String.split issue using regular expressions

Hello All,

First time posting, so please forgive me if I violate any etiquette....

I have a standard csv text file that I am reading line by line. I thought each line was formatted as follows:

FieldA, FieldB, FieldC, ......, FieldX

and I was using str.split(",") to separate into tokens.

However, I have found that some lines contain commas that are not supposed to be part of the parsing. For example, one line may look like

FieldA, FieldB, "Field C has some commas, commas, and more commas in it", ...., FieldX

Anytime that the line contains non-separating commas, the author is very careful to enclose the entire field containing the ignorable commas in double quotes. So, what I would like to do is to create a split expression that will split the line based on commas that are not inside of double quotes, but I have no idea how to do it. I have looked at the regex area of the tutorial and tried

str.split(",&&[^\".,.\"]");

but it does not work.

Any help appreciated!!

[1075 byte] By [j_embrya] at [2007-11-27 0:36:05]
# 1
For example you can split using simple "," then assemble few fragments back to one string, starting from fragment started with " and stopping at fragment stopped with " :)
Michael.Nazarov@sun.coma at 2007-7-11 22:45:15 > top of Java-index,Java Essentials,Java Programming...
# 2

The best way to parse CSV data is to use a dedicated tool, like the ones listed in [url=http://en.wikipedia.org/wiki/CSV_application_support]this article[/url]. If you have to use regexes, or just want to learn how, a positive matching approach is preferable to split(). The following code, a modification of some sample code[1] that appears in [url=http://www.oreilly.com/catalog/regex3/index.html]The Book[/url], assumes quoted fields in your data may contain escaped quotation marks in addition to commas, but may not contain line separators. import java.util.*;

import java.util.regex.*;

public class Test

{

public static void main(String... args)

{

String str =

"FieldA, FieldB, \"Field C with commas, commas, and more commas\", , FieldX";

List<String> fields = parseCsvLine(str);

int i = 0;

for (String s : fields)

{

System.out.printf("%nField %d: [%s]%n", i++, s);

}

}

public static List<String> parseCsvLine(String line)

{

String regex =

"(?<=^|,)[ \t]*+"+ // Optional leading whitespace,

"(?:"+ // followed by either...

"\"([^\"]*+(?:\"\"[^\"]++)*+)\"" + // ...by a quoted field...

"|"+ // ...or...

"([^\",]*+)" + // ...some non-quoted text,

")[ \t]*+"; // and optional trailing whitespace.

// Create a matcher for CSV fields, using the regex above.

Matcher mMain = Pattern.compile(regex).matcher(line);

// Create a matcher for doubled double-quotes

Matcher mQuote = Pattern.compile("\"\"").matcher("");

List<String> result = new ArrayList<String>();

while (mMain.find())

{

// If field was not quoted, take it as it is; if it was quoted,

// unescape any embedded quotation marks.

String field = (mMain.start(2) != -1) ? mMain.group(2).trim()

: mQuote.reset(mMain.group(1)).replaceAll("\"");

result.add(field);

}

return result;

}

}

[1] http://regex.info/listing.cgi?ed=3&p=401

uncle_alicea at 2007-7-11 22:45:15 > top of Java-index,Java Essentials,Java Programming...