String.split issue using regular expressions
Hello All,
First time posting, so please forgive me if I violate any etiquette....
I have a standard csv text file that I am reading line by line. I thought each line was formatted as follows:
FieldA, FieldB, FieldC, ......, FieldX
and I was using str.split(",") to separate into tokens.
However, I have found that some lines contain commas that are not supposed to be part of the parsing. For example, one line may look like
FieldA, FieldB, "Field C has some commas, commas, and more commas in it", ...., FieldX
Anytime that the line contains non-separating commas, the author is very careful to enclose the entire field containing the ignorable commas in double quotes. So, what I would like to do is to create a split expression that will split the line based on commas that are not inside of double quotes, but I have no idea how to do it. I have looked at the regex area of the tutorial and tried
str.split(",&&[^\".,.\"]");
but it does not work.
Any help appreciated!!
[1075 byte] By [
j_embrya] at [2007-11-27 0:36:05]

The best way to parse CSV data is to use a dedicated tool, like the ones listed in [url=http://en.wikipedia.org/wiki/CSV_application_support]this article[/url]. If you have to use regexes, or just want to learn how, a positive matching approach is preferable to split(). The following code, a modification of some sample code[1] that appears in [url=http://www.oreilly.com/catalog/regex3/index.html]The Book[/url], assumes quoted fields in your data may contain escaped quotation marks in addition to commas, but may not contain line separators. import java.util.*;
import java.util.regex.*;
public class Test
{
public static void main(String... args)
{
String str =
"FieldA, FieldB, \"Field C with commas, commas, and more commas\", , FieldX";
List<String> fields = parseCsvLine(str);
int i = 0;
for (String s : fields)
{
System.out.printf("%nField %d: [%s]%n", i++, s);
}
}
public static List<String> parseCsvLine(String line)
{
String regex =
"(?<=^|,)[ \t]*+"+ // Optional leading whitespace,
"(?:"+ // followed by either...
"\"([^\"]*+(?:\"\"[^\"]++)*+)\"" + // ...by a quoted field...
"|"+ // ...or...
"([^\",]*+)" + // ...some non-quoted text,
")[ \t]*+"; // and optional trailing whitespace.
// Create a matcher for CSV fields, using the regex above.
Matcher mMain = Pattern.compile(regex).matcher(line);
// Create a matcher for doubled double-quotes
Matcher mQuote = Pattern.compile("\"\"").matcher("");
List<String> result = new ArrayList<String>();
while (mMain.find())
{
// If field was not quoted, take it as it is; if it was quoted,
// unescape any embedded quotation marks.
String field = (mMain.start(2) != -1) ? mMain.group(2).trim()
: mQuote.reset(mMain.group(1)).replaceAll("\"");
result.add(field);
}
return result;
}
}
[1] http://regex.info/listing.cgi?ed=3&p=401