how to get byte offset

Hello,

I am trying to parse a XML file and want to store the byte positions of start and closing tags so I can later use RandomAccessFile to get a particular xml element. But I am having some trouble with it...

I haven't found any Xml parser api that would allow me to do this, so I tried to parse the file myself.

I tried to use CountingInputStream to keep track of the byte position, but when wrapping in a InputStreamReader and reading a single character, the position I get from the CountingInputStream is not the first character, but rather the end of what seems a buffer. So even if I read char by char I get the same byte position for all of them until the next chunk is buffered.

I couldn't find a way to switch off the buffering of the reader.

It would be very helpful if somebody could give me a hint as how to solve my problem.

Thanks a lot!

[898 byte] By [Flossya] at [2007-11-27 11:24:24]
# 1

What's your exact goal... would you just like to transform the XML ?

DikkeDouwea at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...
# 2

I want to be able to pull out arbitrary elements (start to stop tag) form a XML document based on its xpath from the root element and without having to read the whole document (and if possible independent of the used character encoding). Therefore I want to parse the document once and create a index of the start and stop positions of all elements.

Basically I want to access and extract parts of the xml given a xpath expression.

I thought I could use RandomAccessFile to retrieve the data, but that is based on byte positions:

RandomAccessFile raf = new RandomAccessFile(file, "r");

raf.seek(startPos);

raf.read(byteArray, 0, stopPos-startPos);

Hope that helps. I can explain in more detail if necessary...

Thanks!

Flossya at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...
# 3

getFilePointer()

read the API

masijade.a at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...
# 4

Well, maybe I haven't expressed myself clearly:

I assume you mean getFilePointer() from RandomAccessFile?

I do know how to read form a specified position in the file with RandomAccessFile. After I have positioned the pointer with seek, I can check this position with getFilePointer... and I can read from that position, no problem!

BUT: My problem is in getting the byte offset of the XML TAGs in the first place. The positions I get when using getLocation of the XMLStreamReader don't seem to be correct... and I would be happy if anybody could help me in how to get them.

Flossya at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...
# 5

read the file character by character and, when you find a tag record its position using getFilePosition(). What is there not to understand?

And what XMLStreamReader? I thought you were using RandomAccessFile.

Maybe you need to define for us, a little better, exactly what you are trying to do, and how you are trying to do it.

masijade.a at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...
# 6

ok, here a more detailed explanation:

I am dealing with very big XML files and since I don't need all the information in them, I would like to have random access to the XML elements, to pick the ones I am interested in (without loading/searching the whole document every time).

So, I want to create a index of the XML file with xpath expressions (from the root to the element of interest) as keys and byte positions as values. Therefore I need to parse the XML once and whenever I find a start/end tag record its byte position.

Once I have the index I want to use RandomAccessFile to retrieve a XML element from its start to end positions provided by the index.

The retrieval with RandomAccessFile works fine, it is the generation of the index that I have problem with.

I could as you suggested, query for the file encoding, read one byte or two according to the encoding, convert it into a character and process the whole file character by character, writing my own XML parser...

But since there are more clever people out there than I am and since XML parsing is quite a common task I was trying not to re-invent the wheel and find a existing parser (which is most probably more efficient and accurate than anything I could write myself).

I have come across XMLStreamReader, which can give me access to START_ELEMENT and END_ELEMENT XMLEvents and their byte position. Unfortunately, this does not seem to work, since the XMLEvent positions change when the XML layout changes (e.g. formated vs. unformated).

Therefore (unless someone can tell me how to get the right positions from XMLStreamReader) I am looking for another solution, preferably one where I don't have to write my own parser (because I am not good at it).

However, if there is no such parser out there or someone has a better solution as to how to retrieve XML chunks from a file given the xpath without loading/searching the whole file, I would be very happy to try alternatives.

Hope that helps.

Thanks

Flossya at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...
# 7

> I have come across XMLStreamReader, which can give me

> access to START_ELEMENT and END_ELEMENT XMLEvents and

> their byte position. Unfortunately, this does not

> seem to work, since the XMLEvent positions change

> when the XML layout changes (e.g. formated vs.

> unformated).

>

> Therefore (unless someone can tell me how to get the

> right positions from XMLStreamReader) I am looking

> for another solution, preferably one where I don't

> have to write my own parser (because I am not good at

> it).

>

Am I missing something here? If you change the file format AFTER you index it then surely you need to re-index it!

Why don't you parse the XML using SAX and place the content in a database?

sabre150a at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...
# 8

No, I don't change the file, I am talking about two different files with the same content.

With changing formats I meant doing the same thing with a altered file:

1. XML file (1) formated: building the index, taking start and stop position of element X from the index and read in a String (1) from start to stop of file (1).

2. XML file (2), same file but unformated (one long line): building the index, taking start and stop position of same element X and read in a String (2) from start to stop of file (2).

Then comparing the Strings (1) and (2) and they are not the same. The position of String (2) is shifted against String (1).

As for storing in a DB:

This is meant to be part of a time and resource critic application and I am not interested in most of the data from the XML file and particularly not in storing it. I think It would be too much overhead and would defeat the purpose of the whole thing.

Flossya at 2007-7-29 15:58:05 > top of Java-index,Core,Core APIs...