Split and Join: Using Ruby arrays and CSV files as databases 5

Posted by science on November 09, 2007

Splitting wood If you mess around with CSV or other text delimited files and you use Ruby, you should know all about String::split and Array::join. If you don’t, here’s a quick primer. Split allows you to break up a string into elements of an array based on a delimiter (it returns an array). Join will merge an array into a single string, using a delimiter to separate the array elements. Let’s take a “|” (pipe) delimited file as an example (I prefer them to CSV b/c they are less common in input streams and prefer them to TAB b/c it’s easier to work with). I’m going to assume that you don’t have to escape your delimiter field to simplify things but it’s relatively simple to work around that in the real world. Our example task will be to take an input file that has 5 columns, and transform it into a file that has only 3 of those columns. Here’s the input file:

id|col1|col2|col3|col4
1|val1.1|val1.2|val1.3|val1.4
2|val2.1|val2.2|val2.3|val2.4
3|val3.1|val3.2|val3.3|val3.4
4|val4.1|val4.2|val4.3|val4.4

We want to remove “col2″ and “col4″ from this file and save the results. Here’s some ruby code that does that. I’m going to assume that you DON’T know the order of the columns, which will demonstrate some pretty useful header line management using arrays. Furthermore we will transform the output order to:

id|col3|col1

Here’s the code that does the job. Pretty cool. Drop me a line if something doesn’t make sense.

infile = File::open('infile.txt', 'r')
outfile = File::open('outfile.txt', 'w')

header = infile.gets
hdr_cols = header.split('|').collect {|h| h.strip}

id_idx = hdr_cols.index('id') || (raise Exception, 'id')
col1_idx = hdr_cols.index('col1') || (raise Exception, '1')
col3_idx = hdr_cols.index('col3') || (raise Exception, '3')

# write the output header
outfile.puts('id|col3|col1')
# write each line with only columns we want
infile.each do |line|
  cols = line.split('|')
  line = []
  line << cols[id_idx].chomp
  line << cols[col3_idx].chomp
  line << cols[col1_idx].chomp
  outfile.puts(line.join('|'))
end

outfile.close
infile.close
Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. jon Thu, 03 Apr 2008 08:43:23 UTC

    The other way round you could use a home-made CSV parsing regex to extract the CSV data again. Just my 2 cent.

    http://snippets.dzone.com/posts/show/4430

  2. science Sun, 02 Nov 2008 20:55:09 UTC

    Hey Jon – your code looks like it works but feels a lot like Perl to me (hard to read and maintain). The goal of my sample code above is to demonstrate how easy it is to write super readable code in Ruby that does useful work.. But each to his own! – Science

  3. phil Tue, 13 Jan 2009 07:45:56 UTC

    Hello
    The above code works well – but, when I tried to modify code to select id and col4 it failed with Exception error. No doubt it’s down to the fact I am a newbie with Ruby so – was wondering how the code looks when changed by someone who knows what they are doing.

    Regards
    Phil

  4. science Tue, 13 Jan 2009 08:45:31 UTC

    Phil: I’ll contact you off-line and hopefully we can figure it out!

    Science

  5. science Thu, 15 Jan 2009 17:13:02 UTC

    Update – Phil provided some sample data that showed up a bug in the original code. The OP now shows new code. The change is due to the somewhat unintuitive fact that when you split a line of text in Ruby, Ruby preserves the newline in the last array cell. This makes some sense when you’re splitting and joining in-line.

    But with more complex work, like the example above this can bite you. I had “chomped” the newlines correctly on the output, but not for the headers. As a result the last header row would be captured as “col4\n” but we want only “col4″ – so I added the code:

    
    hdr = hdr_row.split('|').collect {|h| h.strip}
    

    I actually chose to “strip” each header item which let’s the header row have spaces between the delimiters, as well as trimming the new line. If you only want to remove the new lines (in the unlikely event you have significant leading/trailing whitespace in your header names) then use “chomp” instead of “strip.”

Comments