XMLStreamReader byte location missmatch

Hello,

I am trying to parse a XML file and want to store the byte positions of start and closing tags so I can later use RandomAccessFile to get a particular xml element.

This should be used to access very big xml files. I want to be able to pull out a arbitrary xml element given only the xpath expression from the root element.

But I am having some trouble with it.

I tried to use XMLStreamReader to parse the XML and getLocation to get the byte offset, but the position I get when a START_ELEMENT event occurs is not the start of the XML tag! It is somewhere after that. I have even tried to use the previous or next positions, but none of them fits the start tag.

XMLInputFactory xif = XMLInputFactory.newInstance();

XMLStreamReader xsr = xif.createXMLStreamReader(is);

while ( xsr.hasNext() ){

if (xsr.isStartElement()){

int currentPos = xsr.getLocation().getCharacterOffset();

System.out.print("current position: " + currentPos);

}

}

It would be very helpful if somebody could give me a hint as how to solve my problem or where to find a better way to do this.

Thanks a lot!

[1402 byte] By [Flossya] at [2007-11-27 11:27:35]
# 1

I tried some more and it seems that the position of the XMLEvent changes when the layout of the XML changes.

For example using the same code to retrieve the byte positions of a specific tag, I get different results.

Test XMl files:

XML 1:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<first>

<second>

<third>

<forth>Text1a</forth>

<forth>Text1b</forth>

<forth>Text1c</forth>

</third>

</second>

<second>

<third>Text2</third>

</second>

</first>

XML 2 (same XML, but unformated):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><first><second><third><forth>Text1a</forth><forth>Text1b</forth><forth>Text1c</forth></third></second><second><third>Text2</third></second></first>

When querying the positions for the start and stop tags of the first occurrence of the "forth" element and then using RandomAccessFile to read in those positions, I get the following results:

XML1:

"<forth>Text1a</forth>"

XML2:

"th>Text1a</forth><fort"

I would have expected that the layout of the XML file has no impact.

Is this a bug or can anybody explain what I must have overlooked?

Thanks!>

Flossya at 2007-7-29 16:17:02 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...