How do you ignore text not in tags when parsing with SAX

I want to parse log files using the SAX parser. In the log file, at the beginning of the file I have normal text that are not in xml tags, then xml tags, followed by normal text not in xml tags.

How do I make my program ignore the normal text not in xml tags in the file and just parse the xml tags.

This is my code:

import java.io.*;

import org.xml.sax.*;

import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.SAXParserFactory;

import javax.xml.parsers.ParserConfigurationException;

import javax.xml.parsers.SAXParser;

publicclass LogParserextends DefaultHandler

{

publicstaticvoid main(String argv[])

{

if (argv.length != 1){

System.err.println("Usage: cmd filename");

System.exit(1);

}

// Use an instance of ourselves as the SAX event handler

DefaultHandler handler =new LogParser();

// Use the default (non-validating) parser

SAXParserFactory factory = SAXParserFactory.newInstance();

try{

// Set up output stream

out =new OutputStreamWriter(System.out,"UTF8");

// Parse the input

SAXParser saxParser = factory.newSAXParser();

saxParser.parse(new File(argv[0]), handler);

}catch (Throwable t){

t.printStackTrace();

}

System.exit(0);

}

staticprivate Writer out;

private String indentString ="";// Amount to indent

privateint indentLevel = 0;

//===========================================================

// SAX DocumentHandler methods

//===========================================================

publicvoid startDocument()

throws SAXException

{

nl();

nl();

emit("START DOCUMENT");

nl();

emit("<?xml version='1.0' encoding='UTF-8'?>");

}

publicvoid endDocument()

throws SAXException

{

nl(); emit("END DOCUMENT");

try{

nl();

out.flush();

}catch (IOException e){

thrownew SAXException("I/O error", e);

}

}

publicvoid startElement(String namespaceURI,

String lName,// local name

String qName,// qualified name

Attributes attrs)

throws SAXException

{

indentLevel++;

nl(); emit("ELEMENT: ");

String eName = lName;// element name

if ("".equals(eName)) eName = qName;// namespaceAware = false

emit("<"+eName);

if (attrs !=null){

for (int i = 0; i < attrs.getLength(); i++){

String aName = attrs.getLocalName(i);// Attr name

if ("".equals(aName)) aName = attrs.getQName(i);

nl();

emit("ATTR: ");

emit(aName);

emit("\t\"");

emit(attrs.getValue(i));

emit("\"");

}

}

if (attrs.getLength() > 0) nl();

emit(">");

}

publicvoid endElement(String namespaceURI,

String sName,// simple name

String qName// qualified name

)

throws SAXException

{

nl();

emit("END_ELM: ");

emit("</"+sName+">");

indentLevel--;

}

publicvoid characters(char buf[],int offset,int len)

throws SAXException

{

nl(); emit("CHARS:");

String s =new String(buf, offset, len);

if (!s.trim().equals("")) emit(s);

}

//===========================================================

// Utility Methods ...

//===========================================================

// Wrap I/O exceptions in SAX exceptions, to

// suit handler signature requirements

privatevoid emit(String s)

throws SAXException

{

try{

out.write(s);

out.flush();

}catch (IOException e){

thrownew SAXException("I/O error", e);

}

}

// Start a new line

// and indent the next line appropriately

privatevoid nl()

throws SAXException

{

String lineEnd = System.getProperty("line.separator");

try{

out.write(lineEnd);

for (int i=0; i < indentLevel; i++) out.write(indentString);

}catch (IOException e){

thrownew SAXException("I/O error", e);

}

}

}

When I run the code with my log file I get the following error message:

org.xml.sax.SAXParseException: Content is not allowed in prolog.

at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)

at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Unknown Source)

at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)

at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)

at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(Unknown Source)

[9482 byte] By [SDNJavaa] at [2007-11-27 7:52:27]
# 1
You write your own code that skips over non-XML content before passing the XML content to the parser. Or you work on fixing the process that produces that log file so that it produces a well-formed XML document.
DrClapa at 2007-7-12 19:33:41 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 2
org.xml.sax.SAXParseException: Content is not allowed in prologDoes the prolog have content? Remove empty lines and spaces before the XML declaration.
dvohra09a at 2007-7-12 19:33:41 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 3

> You write your own code that skips over non-XML

> content before passing the XML content to the parser.

> Or you work on fixing the process that produces that

> log file so that it produces a well-formed XML

> document.

Could you show and example of code that skips over non xml content. Bear in mind that there are many non xml content in the file and could be before and after xml content.

Can anybody show an example code of how to do this

SDNJavaa at 2007-7-12 19:33:41 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 4
It's going to be difficult. It's hard to pick out an XML document that's embedded in random text, so the code would have to depend on what you expected to see before the XML. That's why I suggested fixing the process rather than trying to deal with badly-structured data.
DrClapa at 2007-7-12 19:33:41 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 5

The system that produces the log files is out of my control. I can only work with what it produces.

The log files are different every time so I don抰 know where the xml document will start and finish.

Say for example I have log file that look like this:

Some random text here

Again, some random text with different character []()***

And yet again )_==8765 some &^?random text-**

<?xml version="1.0" encoding="UTF-8"?>

<note>

<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

And here some random text with different characters }[##~~

How can I pick out the xml document embedded with badly-structured data

<?xml version="1.0" encoding="UTF-8"?>

<note>

<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

SDNJavaa at 2007-7-12 19:33:41 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 6
Any idea anybody?
SDNJavaa at 2007-7-12 19:33:41 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...