Split and Join: Using Ruby arrays and CSV files as databases 1

Posted by science on November 09, 2007

Splitting wood

If you mess around with CSV or other text delimited files and you use Ruby, you should know all about String::split and Array::join. If you don’t, here’s a quick primer. Split allows you to break up a string into elements of an array based on a delimiter (it returns an array). Join will merge an array into a single string, using a delimiter to separate the array elements.

Let’s take a “|” (pipe) delimited file as an example (I prefer them to CSV b/c they are less common in input streams and prefer them to TAB b/c it’s easier to work with). I’m going to assume that you don’t have to escape your delimiter field to simplify things but it’s relatively simple to work around that in the real world.

Our example task will be to take an input file that has 5 columns, and transform it into a file that has only 3 of those columns. Here’s the input file:

id|col1|col2|col3|col4
1|val1.1|val1.2|val1.3|val1.4
2|val2.1|val2.2|val2.3|val2.4
3|val3.1|val3.2|val3.3|val3.4
4|val4.1|val4.2|val4.3|val4.4

We want to remove “col2″ and “col4″ from this file and save the results. Here’s some ruby code that does that. I’m going to assume that you DON’T know the order of the columns, which will demonstrate some pretty useful header line management using arrays. Furthermore we will transform the output order to:

id|col3|col1

Here’s the code that does the job. Pretty cool. Drop me a line if something doesn’t make sense.

infile = File::open('infile.txt', 'r')
outfile = File::open('outfile.txt', 'w')

header = infile.gets
hdr_cols = header.split('|')

id_idx = hdr_cols.index('id') || (raise Exception, 'id')
col1_idx = hdr_cols.index('col1') || (raise Exception, '1')
col3_idx = hdr_cols.index('col3') || (raise Exception, '3')

# write the output header
outfile.puts('id|col3|col1')
# write each line with only columns we want
infile.each do |line|
  cols = line.split('|')
  line = []
  line << cols[id_idx].chomp
  line << cols[col3_idx].chomp
  line << cols[col1_idx].chomp
  outfile.puts(line.join('|'))
end

outfile.close
infile.close
Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. jon Thu, 03 Apr 2008 08:43:23 EDT

    The other way round you could use a home-made CSV parsing regex to extract the CSV data again. Just my 2 cent.

    http://snippets.dzone.com/posts/show/4430

Comments