While programming, its a real pain if the data that you want to process is present in xls spreadsheets. If its only one xls file, then you can open it using MS Excel or Open Office and save it as a csv file. Once you get the data in csv format, life becomes simple. Almost all the programming languages have libraries for parsing csv files. With scripting languages like Python or Ruby, you can literally play around with csv files with just a few lines of code.
But what if the data is contained in many xls files? This is a typical situation that a programmer may face when he is working with non-programmers. For example, in the
research project that I am working on, we receive data on plant species from Botanists at the Smithsonian Institute. The data that they send is usually spread across 200 or so xls spreadsheets. In such situations, its great to have a command line tool that can convert xls to csv.
There are two command line tools that I found that do a pretty good job in converting xls to csv. They both have their advantages and disadvantages (which I will be talking about). Here they are:
1.
xls2csv (by V.B.Wagner)
Here is the
project webpage and here is the
direct link to download the source.
After installing it, here is how I ran it:
$ xls2csv myfile.xls > myfile.csv
xls2csv prints the output to stdout which can then be redirected to a file
The good news:
1. Installation is straightforward.
2. It works great, even when the xls file has multiple worksheets. In such a case (xls file with multiple worksheets), xls2csv will print all the worksheets onto the csv file with the contents of each worksheet separated by a form-feed character (^L)
3. In any line, even if the first "n" columns are empty and data is present in the "(n+1)th" column, xls2csv will recognize that line and include it in the generated csv file
The bad news:
1. xls2csv has problems with dates. If the xls file has dates, like "03/31/85" or "31-Mar-1985" etc, then xls2csv will not reproduce the date as is. Instead it will convert the date into a number while generating the csv file. The man page for xls2csv mentions that using the -f flag to specify the date format will solve this problem. But I could never got it work; several others have also reported this problem in discussion boards.
So, if your xls files do not have dates, this is the thing for you. You will be happy with the results.
Here is the second tool:
2.
xls2csv (by Ken Prows)
This tool is also called as xls2csv. But this is a different implementation by a different author. This is in Perl by Ken Prows whereas the first one was in C by V.B.Wagner (Note: Both these tools install to /usr/local/bin. So, if you will be installing both, make sure to configure them so that they have separate installation paths).
Here is the
project webpage and here is the
direct link to download the source.
Installing this can be a pain if you don't have the required Perl modules that xls2csv requires for its installation. However installing the required modules using CPAN is very easy. If you are having problems with the installation, here is a very nice
tutorial on installing xls2csv.
After installing it, here is how I ran it:
$ xls2csv -x myfile.xls -c myfile.csv
The good news:
1. Unlike the 1st tool, this works great with dates. The generated csv file will have the dates in the exact same format as the original xls file.
The bad news:
1. Installation can become a bit tricky if you don't have the required Perl modules.
2. If the xls has multiple worksheets, by default, it will convert only the first worksheet to csv. However, it does support a -w flag wherein you can specify the sheet name that you would like to convert:
$ xls2csv -x myfile.xls -c myfile.csv -w worksheet_name
It also supports a -W flag using which you can list all the worksheets in the xls file. Usage:
$ xls2csv -W -x myfile.xls
3. If the first column is empty, the entire row is ignored. So if you have a xls file in which the first column is blank, but has tons of data from second column onwards, xls2csv will generate an empty csv file... That's bad !!
Well, if you have xls files which have lots of dates (and hopefully single worksheets per xls file), this is the tool for you.
Now lets batch-convert...
Here is my Ruby script to batch-convert xls files to csv. It takes 2 arguments: 1) the source directory that has all the xls files and 2) the target directory where you want to save the generated csv files. So, the usage is:
$ ruby xls2csv.rb "/home/vinay/myxls" "/home/vinay/mycsv"
where "myxls" is the directory which has all the xls files and "mycsv" is the directory into which all the csv files will be generated.
Here is the Ruby script:
# Author: Vinay Kumar Bettadapura
if ARGV.length != 2
puts "usage: \"ruby xls2csv.rb source_dir target_dir\""
exit -1
end
source_dir = ARGV[0]
target_dir = ARGV[1]
if !File.exists?(target_dir)
puts "target_dir \"#{target_dir}\" is not a valid directory"
exit -1
end
source_entries = []
begin
source_entries = Dir.entries(source_dir).sort
# To remove the first two array elements which
# will be "." and ".."
source_entries.shift
source_entries.shift
rescue Exception => e
puts "source_dir \"#{source_dir}\" is not a valid directory"
exit -1
end
if source_entries.empty?
puts "source_dir \"#{source_dir}\" is empty"
exit 0
end
source_entries.each{|file|
source_file = source_dir + "/" + file
target_file = target_dir + "/" + file.gsub(".xls", ".csv")
puts "Converting \"#{source_file}\" to \"#{target_file}\""
`xls2csv \"#{source_file}\" > \"#{target_file}\"`
}
puts "Done..."
The Ruby script uses the 1st xls2csv tool. If you want to use the 2nd xls2csv tool, then replace
`xls2csv \"#{source_file}\" > \"#{target_file}\"`
on line number 39 with
`xls2csv -x \"#{source_file}\" -c \"#{target_file}\"`
in the Ruby script.