How best to manipulate a log file that is not well-formed
I have a text file that look something like this:
[21/06/07] System DEBUG * BA_LOG_OUTPUT
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text <?xml version=?.0?encoding=擴FT-8?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
[21/06/07] System DEBUG * BA_LOG_OUTPUT
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text <?xml version=?.0?encoding=擴FT-8?>
<note>
<to>Tony</to>
<from>James</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
[21/06/07] System DEBUG * BA_LOG_OUTPUT
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text <?xml version=?.0?encoding=擴FT-8?>
<note>
<to>Amy</to>
<from>Tobi</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
I am trying to read the file so that it picks the welled formed XML and put it through a SAX parser. For example, it should pick out the following before putting into a SAX parser: <?xml version=?.0?encoding=擴FT-8?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<?xml version=?.0?encoding=擴FT-8?>
<note>
<to>Tony</to>
<from>James</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Amy</to>
<from>Tobi</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
There is a pattern in the file, the line
<?xml version=?.0?encoding=擴FT-8?>
Always start on the 4th line after BA_LOG_OUTPUT line
publicvoid readFile(){
String line =null;
Pattern p = Pattern.compile("BA_LOG_OUTPUT" );
Pattern p1 = Pattern.compile("[" );
try{
while( ( line = br.readLine() ) !=null ){
Matcher m = p.matcher(line);
while( m.find() ){
for(int i = 0; i < 5; i++ ){
line = br.readLine();
}
System.out.println(line);
}
}
}catch( Exception e ){
System.err.println( e.getMessage() );
}
}
In the code above, I can find where I want to read from, but how do I make the code so that it read the rest of the xml save them in stack (which would eventually be passed to SAX parser) and stop when it find the pattern [.
Find the next pattern in the file (BA_LOG_OUTPUT), read the whole xml which start 4 line after the pattern, save it on the stack, stop reading when it find the pattern [. and so on.
It should continue to do the same throughout the file.
[4470 byte] By [
SDNJavaa] at [2007-11-27 8:39:43]

The solution may work but the problem is a bit more complicated than that
. The example I gave above is just an example. In the real log file, the tags are never the same, so you may have something that look like this:
[21/06/07] System DEBUG * BA_LOG_OUTPUT
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text <?xml version=?.0?encoding=擴FT-8?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
[21/06/07] System DEBUG * BA_LOG_OUTPUT
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text <?xml version=?.0?encoding=擴FT-8?>
<partone>
<todiff>Tony</todiff>
<from>James</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</partone>
[21/06/07] System DEBUG * BA_LOG_OUTPUT
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text
System DEBUG Random text Random text <?xml version=?.0?encoding=擴FT-8?>
<anothertag>
<anothername>Amy</anothername>
<from>Tobi</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</anothertag >
In this case, I cannot use the loop
while(!(line = in.readLine()).equals("</note>")) {
System.out.println(line);
}
How can i make this work?
hey SDN,
for this type of parsing work i suggest you to use javaCC. it will take you little time to learn but believe me the flexibility you will get is awesome...
heres the link to the home site... https://javacc.dev.java.net/
and here is a good tutorial on it... http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-cooltools.html
in javaCC you will have to declare the pattern of your text file that contains the data in a file with .jj extension. after that when you will run javaCC tool on that .jj file it will generate the required java code as normal java source files. then you have to compile the source files and run the parser.
for example,
TOKEN: {< START_OF_START_TAG: "<" >}
TOKEN: {< START_OF_END_TAG: "</" >}
TOKEN: {< END_OF_TAG: ">" >}
these lines says that a start tag can start with "<" and a end tag can start with ">" and both end with ">". now you can use the symbols START_OF_START_TAG, START_OF_END_TAG and END_OF_TAG to declare that anything between tags is data.
or in other way, which will not be as dynamic as the above one, you can say
TOKEN: {< TO_START: "<to>" >}
TOKEN: {< TO_END: "</to>" >}
here, you dont have to bother about the differences between <to>, <To>,<tO> or <TO>... javaCC will treat all of them same if just declare it to be case insensitive...
i am still learning it, so try to help you as much as i can. but its a great tool for parsing text files... you just have to learn it...
regards
i_virus