Parsing Comma Separated Value (CSV) Files with PHP

Learn PHPI hate trying to parse CSV with PHP. I get PO’d ASAP because everyone should be using TSV instead. I know, it can be as difficult to comprehend all of the acronyms as it is to figure out the comma separated values. Not impossible, just very time consuming.

I have had multiple occasions where someone gives me a file and asks me to parse it. The issue is its in CSV format. CSV Sucks! The problem is when you get someone trying to be cute so they put a comma (,) in one of the records.

So let me back up. CSV started out as a basic format to make data portable. A CSV file is basically a database in a way. Each line in the file represents a record. Each field within the record is delimited by a comma. Or so it is suppose to work. So typical CSV looks like this:


r1f1,r1f2,r1f3,r1f4

r2f1,r2f2,r2f3,r2f4

r3f1,r3f2,r3f3,r3f4

So you can see by the each row how the record (r) represents the record number and the field (f) is the field number. So as long as we get a file like this, we are fine. Everything parses nice and easy. But now, let’s talk about what happens when someone decides to use a comma inside an actual field.

Take the following set of records:


first name,last name,address1,address2,city,state,zip

first name1, last name1, "address1,apt",address2,city,state,zip

Do you see that fiendish little comma between address1 and apt? Do you notice the double quotes? That complicates things. The double quotes says, “Hey, everything in here is a record. Including any commas.” Swell! So ok, we have to figure out how to parse the records on every comma except a comma in between quotes.

This is ok until you see something like:


first name,last name,address1,address2,city,state,zip

first name1, last name1, "address1, ""of"" apt",address2,city,state,zip

Now what?! Well, in order to include double quotes, they have to be escaped so they are recognized as double quotes. This means that two double quotes consecutively placed are ignored as a delimiter, but it also needs to be stripped back to one double quote when the record is parsed.

So. How do we do all of this in PHP. Well look no further. I have a great solution that not only works on all of these cases, but it is also RFC compliant. The RFC 4180 is the specification for “Common Format and MIME Type for CSV Files“, published in 1997. Yeah I know. Archaic. Even worse is that it has been obsoleted by RFC 4234. The problem is, there are still CSV files floating around that need parsing.

Keep in mind this is just a quick thrown together script. Nothing polished. I will eventually turn this into a class and add it to my Google Code repository. But if you are in a hurry and need something immediate, here is the code.


function convertCSVtoTSV($line)
 {
 // Declare the new_line variable to make sure it starts empty
 $new_line = '';

 // replacement array for field data preparation
 $replacement_reg = array(
 '/[\n|\r|\t|\v|\f]/',            // remove all invisible return, newline, tabs, etc. from each individual field
 '/^(\s*|"*)(.*?)(\s*|"*)$/m',    // remove all white space from beginning and end of field
 '/^("*)(.*?)("*)$/m',            // remove all " from beginning and end of field
 '/""/',                            // replace all "" with " rfc4180-2.7
 );
 $replacement = array(
 "\\2",                            // replace all special characters
 "\\2",                            // replace all spaces at beginning and end of fields
 "\\2",                            // replace all " at beginning and end of fields
 '"',                            // reduce "" to " (if they exist)
 );

 // split the fields out into an array by the delimiter specified for CSV RFC
 $fields = preg_split('/,(?!(?:[^",]|[^"],[^"])+")/', $line);

 // process each fields cleansing superfluous chracters (spaces, control characters, and delimiting quotes)
 foreach($fields as $result){
 $result = preg_replace($replacement_reg, $replacement, $result);
 // add a tab to each new line
 $new_line .= $result."\t";
 }
 // replace the last tab with a new line feed
 $new_line = preg_replace('/\t$/', "\n", $new_line);
 return $new_line;
 }

Basically you read in a file a line at a time, pass it to this function line at a time, and it will return tab separated values for the record.

So if you want to give it a spin, here is some additional code to see it in action:


$mydata = array(
'testing,this,csv,option',
'"with a, in it",somewhere,in,here',
'also trying,"with a "" to",see ,what happens',
'also adding,additional test,"a "", with a comma",to see'
);

foreach ($mydata as $line)
{
 echo convertCSVtoTSV($line);
}

Conclusion

If you want to avoid the mess caused by CSV files, use TSV (Tab Sebarated Values). It is less likely that someone will use a tab in a field. It makes it much more convenient for parsing and manipulating the data as well. If you have any feedback, please post it. I would love to hear what you think.

Happy coding!

Like it? Post to your favorite location and share.
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Reddit
  • StumbleUpon
  • Twitter

Leave a Comment

You must be to post a comment.