Weird (well for me anyway) SAX parsing problem

Hi,

I have to use SAX to parse huge XML files (the biggest is around 8Gb; genomic data) and input into MySQL. I get an IndexOutOfBoundException when I try to parse a perfectly valid string of an element after a few thousands entries.

Let's say that my string is "G/T". I use split("/") to get the string array and get item[0] and item[1]. The error comes up always when I try to read item[1]. What is the weird part is that if I take out the already read part of the file, it is read perfectly up until a few thousand elements againn with the same error.

Any insights would be gladly appreciated

Sylvain Foisy, Ph. D.

Bio-informatician

Inflammgen.org

Montreal Cardiology Institute

Montreal,Qc

[748 byte] By [foisysa] at [2007-11-26 16:44:16]
# 1

If you have a string "G/T" and you call split("/") you will always get an array with two elements in it. Just because the string was produced by a SAX parser, that makes no difference.

So, it seems like you don't have that string and you are assuming you do have it.

Often when people have "weird SAX problems" it is because they assume the characters() method must always return an entire text node from the document. However the parser is allowed to break the text node into pieces and call the characters() method once for each piece.

This may be your problem, for example if the parser called characters("G/") followed by characters("T").

DrClapa at 2007-7-8 23:11:34 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 2

Hi,

In the characters() method from DefaultHandler, it seems that every once in a while the string returned is not read in full by the parser. The string returned for this element can only be someting like "G/T". When the parser breaks, I get that "G/" string instead. The logic of the parser goes like this:

-look for tag OBSERVED;

-when found, get the string attached to it (something like "G/T");

-split in two chars: G and T

-assigned the chars to variables.

I know that the problem comes from the parser not getting the whole string, just part of it therefore my IndexOutOfBoundException.

I am looking for a way to for the parser to get the whole string.

A+

Sylvain Foisy, Ph. D.

Bio-informatician

Inflammgen.org

Montreal Cardiology Institute

Montreal,Qc

foisysa at 2007-7-8 23:11:34 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 3

> In the characters() method from DefaultHandler, it

> seems that every once in a while the string returned

> is not read in full by the parser.

Yes, that's what I said. That is a documented feature of SAX. You may find that characters() is called twice.

> I am looking for a way to for the parser to get the whole string.

1. In the startElement() method, create a new StringBuffer.

2. In the characters() method, append the data to the StringBuffer, using the append method with three parameters.

3. In the endElement() method you have all the data in the StringBuffer.

DrClapa at 2007-7-8 23:11:34 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 4
Hi,It works beautifully ;-)Thanks for bearing with my thickness...Sylvain
foisysa at 2007-7-8 23:11:34 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...