Sweet Stuff

Stuff to Check out:

*NEW Restaurant Review: "Lazziz Persian Restaurant"- great new place to check out!

*NEW: article: "How To Block Apps on Your Facebook Wall"

*Movie review: Romantic comedies are not normally my interest, but this one was good, "Leap Year"

*NEW: article: "The Spiritual Discipline of Abiding in the Vine"

*WORSHIP LEADERS: What do new people encounter in your services? This article about song selection are thoughts you might consider.

*Contraversial Article "Has Modern Worship Become Corrupt?"

*"Why Pastor's Should Consider Quitting" - one of our most popular articles.

* Kim reviews "the Shack"

Be sure to subscribe to the RSS of your choice to stay up-to-date.

« MasterWriter- Songwriting software (Kim Gentes Worship Tech Blog) | Main | How To Blog & Podcast Your Life Away!! (Kim Gentes) »

Regex Pattern for Parsing CSV files with Embedded commas, double quotes and line breaks

While you have stumbled on KimGentes.com, you might be coming for a few different reasons. Some of you are interested in articles and resources on Christianity, music, worship and such. Others of you are interested in technology information related to church worship settings. Some other folks are programmers who are looking for helpful information on technical challenges. This particular post is a bleed over from some of my technical work in programming. Specifically, this is a post to present a solution to parsing CSV files.

Programmers understand that CSV files are simply text data files that have information stored in value fields in the file. Each of the fields is separated by commas to delimit when one value/field ends and the next begins. This is why they are called "Comma Delimited Value" files (CSV for short). Anyone who is new to this concept or programming might think that writing a program to extract data from files wherein the commas separate the data fields, should be an easy task. And if that was the total sum of it, it would be quick and simple in virtually any language you could choose to do it in. But that is not the end of it. CSV files are written by a host of popular applications and read by thousands of programs as well, including almost every spreadsheet program in existence, including Microsoft Excel. When the first CSV file user started outputting values to fields and reading them in another destination, they quickly realized a limitation- if you wanted to include the literal character of a comma (,) inside of a field value itself, this could not be done, since it would be interpreted as a field separator and its value wouldn't be understood (as well as the field in which it appeared being literally chopped in half).

To overcome this problem, it's assumed that some Neanderthal software developers (back in the Jurassic era of programming) came up with an idea to allow programs to insert and read commas inside of comma separated fields. They would allow fields to be encased in double quotes as a signature that the value inside this field should be read literally (including commas) from the first double quote to the ending double quote. This worked fine and commas could now be embedded in CSV field values. But, as you can guess, these cause further problems for programs- now, the commas of the world had safe haven usage inside of comma separated values, however, double quotes now could not be included inside of a double quote encased field value. Programmers quickly realized that they couldn't keep adding special characters to allow for current special characters to be escaped (which is a way of saying interpreted as literal data without functional consequence in the interpretation of the data).

So, to avoid using other characters to escape current special meaning characters, CSV file progenitors harkened that users could escape double quotes inside of double quote encased CSV fields by placing two double quotes together in the text. This would the standard way of escaping a double quote character ("), by simply placing to double quote characters next to each other, as in "".

All this is fine for the people and programs writing the data- its simple straightforward programming to output such information. But reading CSV files that have embedded double quotes, commas and can include embedded line breaks is a complicated concept. Such is the life of a programmer :). To meet this challenge, we often use a pattern parsing language called Regex (which stands for Regular Expressions).

Regex maybe the most popular language in the programming world. It is used in literally every high level programming language we know of in the world, including Visual Basic, C#, Javascript,  Java, PHP, Perl, Ruby and dozens more. It is included in several utilities such as search functions inside of UltraEdit and Ace Text. And it is included in most revisions of Unix (and other) OSes in command lines functions such as grep, Windows utilities powerGrep and so forth. Technically speaking Regex isn't a programming language on its own. It's a pattern matching engine that is often embedded inside of other languages. It became widely popular due to its inclusion primary in the Unix/Linux command line function of grep and the early web standard language of Perl. Now, most programmers can't conceive of a language that doesn't include some flavor of regex.

That all said, I have chosen to write a regex pattern that can handle parsing the fields of a CSV with all the conditions I mentioned above. There are plenty of other examples of CSV parsers around, but none seem to do the trick I was looking for, which is grandly frustrating when Excel can import and export a CSV with all the listed nuances quickly and easily. So, not finding a good solution, I have written a short CSV parsing pattern. It is below.

CSV-parser (regex pattern below)

 

^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

 

*YOU ARE FREE to use this algorithm in any application (commercial or personal or whatever). It comes with no warrantees.  If you DO end up using this REGEX pattern, please do so with the following considerations:

 

  • make this notation in your source code:  (c)2008 Kim Anthony Gentes - FREE TO USE ANYWHERE.
  • Please post a response on this blog entry below (you do that by clicking on the "Comments" link at the bottom of this entry), saying you found this and are using it. I'd just like to know if its helping people and how people are using it.

 

When using the regex, some important things to know:

Options (turned on in your language/utility): ^ and $ match at line breaks

Description: below is a textual description of the regex pattern that may be helpful to programmers who want to understand what is happening in the regex.

 

Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match the regular expression below and capture its match into backreference number 1 «(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)»
   Match the regular expression below and capture its match into backreference number 2 «("(?:[^"]|"")*"|[^,]*)»
      Match either the regular expression below (attempting the next alternative only if this one fails) «"(?:[^"]|"")*"»
         Match the character “"” literally «"»
         Match the regular expression below «(?:[^"]|"")*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
            Match either the regular expression below (attempting the next alternative only if this one fails) «[^"]»
               Match any character that is NOT a “"” «[^"]»
            Or match regular expression number 2 below (the entire group fails if this one fails to match) «""»
               Match the characters “""” literally «""»
         Match the character “"” literally «"»
      Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^,]*»
         Match any character that is NOT a “,” «[^,]*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the regular expression below and capture its match into backreference number 3 «(,("(?:[^"]|"")*"|[^,]*))*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
      Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «*»
      Match the character “,” literally «,»
      Match the regular expression below and capture its match into backreference number 4 «("(?:[^"]|"")*"|[^,]*)»
         Match either the regular expression below (attempting the next alternative only if this one fails) «"(?:[^"]|"")*"»
            Match the character “"” literally «"»
            Match the regular expression below «(?:[^"]|"")*»
               Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
               Match either the regular expression below (attempting the next alternative only if this one fails) «[^"]»
                  Match any character that is NOT a “"” «[^"]»
               Or match regular expression number 2 below (the entire group fails if this one fails to match) «""»
                  Match the characters “""” literally «""»
            Match the character “"” literally «"»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^,]*»
            Match any character that is NOT a “,” «[^,]*»
               Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the end of a line (at the end of the string or before a line break character) «$»

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (12)

All hail regular expressions! I might have to do actual work without them.

October 15, 2008 | Unregistered CommenterMatt Self

ha! I hear ya Matt... ya.. they are good.. its nice that you can get something like PowerGrep for the PC and use it locally.. btw, if you ever need a good Regex tool (to help test patterns), check out Regex Buddy.. it's great at debugging them.

October 15, 2008 | Registered CommenterKim Gentes

Couldn't get this to work in .net :(

December 1, 2008 | Unregistered Commenterjason

Jason,

be sure to check whether your regex .net flavor has the Options turned on in your language/utility for ^ and $ to match to line breaks.

also, the core part of the pattern that matches a single field is this
("(?:[^"]|"")*"|[^,]*)

if I had a 3 field pattern all the time, I would do the following
^("(?:[^"]|"")*"|[^,]*),("(?:[^"]|"")*"|[^,]*),("(?:[^"]|"")*"|[^,]*)$

some patterns are better set as predetermined scope (such as 3 reps like above than using the unknown * iterator if you aren't sure how to unwind the stack once you find the pattern).

Kim

December 2, 2008 | Registered CommenterKim Gentes

you mentioned the .net implementation... if I used VB.Net, i would write my 3 field iterator as follows:

Try
Dim RegexObj As New Regex("^(""(?:[^""]|"""")*""|[^,]*),(""(?:[^""]|"""")*""|[^,]*),(""(?:[^""]|"""")*""|[^,]*)$", RegexOptions.Multiline)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
' matched text: MatchResults.Value
' match start: MatchResults.Index
' match length: MatchResults.Length
MatchResults = MatchResults.NextMatch()
End While
Catch ex As ArgumentException
'Syntax error in the regular expression
End Try

December 2, 2008 | Registered CommenterKim Gentes

I use biterscripting a lot for parsing of CSV files. They have a sample script at http://www.biterscripting.com/Download/SS_CSV.txt . You can parse by string, line, word (separated by commas), character. Further, you can change the delimiter values, even between lines of the same file, even depending on previous lines, by changing variables $wsep and $lsep.

Patrick

January 9, 2009 | Unregistered CommenterPatrick Mc

Kim--

Using your Regex for some in-house code. Made the following changes for it to work on C#:

string pattern = @"(""(?:[^""]|"""")*""|[^,]*)(?:,(""(?:[^""]|"""")*""|[^,]*))*";

Note removed the start and end line anchors since I'm processing line-by-line. Easy enough to add them back in.

Thanks for posting your Regex. The english-breakdown helped!

James

July 27, 2009 | Unregistered CommenterJames

I found your parser, and was about to use it (Thanks!) until I found this site:
http://www.programmersheaven.com/user/Jonathan/blog/73-Splitting-CSV-with-regex/
whose author, along with Guillaume Roderick, came up with the wonderful:

Ruby: string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

Perl: $fields = preg_split("/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/", $string)

Splitting a string (or line) means putting every piece into an array (or list), which is almost always what you want to do. These lookahead regexps are explained on Jonathan's site. They are fully general.

July 30, 2009 | Unregistered CommenterC. Jack Bamford

Thanks for the original idea.
I could not make the repeating group work in C#, so have created this modified one and use Matches to find the values. use with the ExplicitCapture flag.
Groups[1] contains the result for each match.

(((?<x>(?=[,\r\n]+))|"(?<x>([^"]|"")+)"|(?<x>[^,\r\n]+)),?)

Explanation:

Match type 1: a lookahead checks if we have an empty element - if it finds a comma or a newline. this results in an empty match.
or
Match type 2: if 1 fails, check for a quoted string
or
Match type 3: then for a non-quoted string. this is taken from the original regex.

Finally, consume a comma if there is one.

Behavior:

- newline characters inside "" will become part of the value.
- empty lines outside "" result in one empty value
- escaped quotes inside quoted strings ("") need to be replaced afterwards!

note: if the last line is empty AND the previous line had a comma at the end, no extra result is given. I have no fix for this yet.

C# code to create arrays from csv lines: (uses linq)


string[] SplitCsvLine(string line)
{
return (from System.Text.RegularExpressions.Match m in System.Text.RegularExpressions.Regex.Matches(line,
@"(((?<x>(?=[,\r\n]+))|""(?<x>([^""]|"""")+)""|(?<x>[^,\r\n]+)),?)",
System.Text.RegularExpressions.RegexOptions.ExplicitCapture)
select m.Groups[1].Value).ToArray();
}

Good luck.

November 24, 2009 | Unregistered Commenterh. brouwer

I had become intrigued by the use of Regular Expressions as a way to parse CSV files. I began a project to test these formulas as soon as I learned of their potential as a solution. I was hoping to find some sample code or even a distributable DLL that could be employed for this purpose. So far the most definitive work I’ve seen was the regular expression formulas shared on this blog.

The formula provided by the original author of this blog for some reason does not work for me. When I use this formula on a single string record, it returns the entire record back to me in the first member of the collection created by the MS Regx process. I’m using the Microsoft vbscript.dll version 5.5 in a VB6 program. I’m referring to the following formula:

^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

I then tried the shorter version offered by Kim and had much better results.

("(?:[^"]|"")*"|[^,]*)

The problem with this second formula is that it creates an extra empty string field between each data column. I could live with this except I wasn’t sure if this would be consistent for any type of CSV format or if there was something about my test data that was causing this. I tried running more exotic string combinations and unfortunately the formula didn’t work. Here is the exotic string to which I was referring:

"000523"",1",",HOLTSVILLE"",""40.81518"",""-73.0455"",""5"",""25"",""103"",""Y"",""V13916"",""U"",""""",,A

(note: The above two lines should be a single line)

The correct and actual field breaks (as can be verified with Excel) create four fields and occur as follows:

000523",1
,HOLTSVILLE","40.81518","-73.0455","5","25","103","Y","V13916","U",""
(empty field)
A

I realize this is a tough string to parse but of course, that was the intention. In any event, so far I’ve yet to find a RegX formula that would pass the test. And yet, Excel has no problem with it at all, which still amazes me.

I finally decided to write my own DLL which I intend to use whenever again faced with trying to read and process a correctly formatted CVS file. I realize XML is the chosen way to go for data exchange these days and in many ways find XML superior to flat text files, but there are still legacy processes around and it is nice to know I can handle the situation should it become necessary. Also, there are some files so large that I find XML to be impractical. I’m referring to flat files that or 10 or 15 gig in size. I’ve never tried to process an XML file this large and wonder whether or not it is possible.

Testing my new DLL, I learned that this process is highly susceptible to attain self-serving results. That is why I’ve tried to develop some very unfriendly data to challenge the logic.

As a way to maintain sanity while testing, I chose a simple approach. My “proof of process” was that if Excel can parse the record correctly then I should be able to parse it in exactly the same way with the same results. So I used Excel to verify every string I’ve subjected to my parsing routine. If I attained the same results, I took it to mean that my process was correct.

I’ve used test media from a number of sources, including some I've created using Excel. These were chosen with the sole purpose of trying to confound the logic with highly improbably data. Nevertheless I've no doubt there are probably perfectly valid files out there that my program might not be able to handle. I certainly have not exercised all the conditions that have been included in the logic. If anyone would care to contribute some complicated CSV format files for me to test, I would appreciate hearing from you. You may email them to me but please limit the size of the file(s) to under 25K. Again, I stress, these data must be able to pass the “Excel” test for me to consider them as a viable challenge.

As time permits, I will return the results and if the DLL proves viable, it can be shared should there still be anyone out there still using VB6. Eventually, I will make this a VB.Net program after I am sure it is working.

December 27, 2009 | Unregistered CommenterVisual Basic Programmer

Thanks for your notes all.. Especially all those who have chosen to post your updates and share your code here so others could benefit. VBasic, if you feel free to, we'd love to see what you came up with too... Thanks for posting your thoughts and notes.. Feel free to add your own flavors and updates all, as you have them for your specific apps.

Kim

December 28, 2009 | Registered CommenterKim Gentes

Thanks so much for posting this, and all the helpful comments evrybody's contributed. It's just what I needed and saved me a load of time!

January 7, 2010 | Unregistered CommenterMike

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>