All I get to do at work these days is go to meetings and read email. Most any technical stuff these day is on my own personal time. Today was one of those times when I actually debugged and fixed something. I maintain the archives of a flyfishing listserve at http://archives.flyfishlist.org which includes postings going back to 1990! I used to harvest and index the files monthly, but it’s gotten to be more like quarterly. Recently, I indexed the postings from April thru August. I noticed, though, that when I retrieved the articles, I was getting an extra blank line between all content, which made things much harder to read.
I looked systematically, wondering where the extra line was originating. I download the archive files via email from the server at the University of Kentucky, and then I strip out the email headers and run a PERL program against the file to put it into a format that the WAIS retrieval engine can index. The problem could have come from the server at UKY, from Gmail, from the browser I was using to save the archive files, from updates to PERL on my Mac or other things I’d not thought of…likely not the Linux VM since that’s been pretty static.
Using the “od” (octal dump) utility (found in Linux and MacOSX) I found that the problem seem to be that the archive files, as I saved them from the server, had lines separated with LF/CR (hex 0a 0d). The “good” archives from earlier times were using only the Unix line separator LF (hex 0a). I looked at the PERL program that I was using to reformat the archive for indexing, and decided that it was a good place to insert code to strip out the CR.
Time to try to remember how to write a regular expression! Pulled out the PERL book with a cheat sheet of examples 🙂 and came up with the fix, adding one line of code (plus a comment):
# added next line 9/2/12, apparently gmail saves now with LF/CR
$logLine =~ s/r//; # delete a carriage return
That’s it…but it was fun to figure out!