CSV parsing joy

Hi there,

Was wondering if there was an elegant solution to the following requirement.

I have data arrriving in one big string, with the following format:

"abc""def""ghi","pqr""stu""vwx"

That is

* Each 'row' of data is separated with commas

* Each row has the same number of individual values

* Each value is separated with a single tab character

* Each value is wrapped in double quotes

The simplest approach would be to use split like:

for (String row : data.split(",")){

for (String value : row.split("\\t")){

// etc..

}

}

But, a value can contain a comma. So I have had to resort to:

for (String row : data.split(getRowSplitRegex()){

// etc..

}

private String getRowSplitRegex(){

String positiveLookBehindForQuote ="(?<=\")";

String positiveLookAheadForQuote ="(?=\")";

String regex = positiveLookBehindForQuote +"," + positiveLookAheadForQuote;

return regex;

}

That is, split on comma, so long as the characters before and after were both double quotes.

However, this does not deal with pathological case where a value is itself the string ",". In which case, example data would look like:

"abc"",""ghi","pqr""stu""vwx"

And the fancy regex won't work for this.

Since the number of values in a row is constant (based on a header section), perhaps I could build up a regex for repeated rows, and use Matcher.find?

Any more elegant solutions?

Thanks, Neil

[2423 byte] By [toolkita] at [2007-11-27 11:48:33]
# 1

I suppose I could use ANTLR or something similar and define my tokens and parsing rules appropriately?

toolkita at 2007-7-29 18:19:56 > top of Java-index,Java Essentials,Java Programming...
# 2

I bet there's a regular expression that will do it for you. But I'm no regex expert! Stick around for a bit, there are a few regulars who have black belts in regex who will undoubtedly be happy to help

georgemca at 2007-7-29 18:19:56 > top of Java-index,Java Essentials,Java Programming...
# 3

<* Each row has the same number of individual values

couldn't you substring it by looking at the number of whitespaces and keep going making where you start the value of the end of the last substring ?

mark07a at 2007-7-29 18:19:56 > top of Java-index,Java Essentials,Java Programming...
# 4

> CSV parsing joy

This is not a CSV format. There are rules that need to be followed and using a "," to denote a new line is not one of the rules. So if you want to use standard CSV parsers then you need to change the format of your data. Then search the forums for standard CSV parsers.

Otherwise you need to parse the string one character at a time. Everytime you find a quote you set a variable, lets say, inQuote. The next time you fine a quote you toggle the variable. Then when you find a "," you check the variable to determine whether you are in a quote or not.

camickra at 2007-7-29 18:19:56 > top of Java-index,Java Essentials,Java Programming...