Encoding problems with DOM

First of all, to explain the error: I am writing an application that is receiving an XML message as a String. I need to parse it to either extract information from a CDATA field, or to extract an error message from <Error> tags if they exist. I will have no use for the XML message after this step, I will just pass the extracted error message or data onto another method. I tried using the following test XML message:

String sResponse ="<?xml version=\"1.0\" encoding=\"utf-16\"?>" +

"<Response xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" "+

"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">" +

"<Errors xmlns=\"\">\n<Error>\n<Number>10009</Number>\n<Severity>Error</Severity>\n<Message>Virus detected.</Message>\n" +

"<Details><string>Virus detected in the message content; request terminated.</string>\n" +

"<string>Virus Name = HTML_TEST_VIRUS</string>\n<string>Offset = 0</string>\n"+

"</Details>\n</Error>\n</Errors>\n</Response>";

I then use the following code to begin parsing the string (this is the only way I could figure out how to get an XML string to the parse method, so if someone knows a better way I'm open to suggestions, but this is not my main question):

ByteArrayInputStream baInput =new ByteArrayInputStream(sResponse.getBytes());

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

DocumentBuilder builder = factory.newDocumentBuilder();

Document document = builder.parse(baInput);

This code with this test message produces the following exception at the builder.parse line:

[Fatal Error] :1:40: Content is not allowed in prolog.

org.xml.sax.SAXParseException: Content is not allowed in prolog.

at ...

After some Google searching, it looks like this error is often thrown when there are encoding problems. If I change the message in sResponse to use "utf-8" OR if I remove the prolog entirely (remove "<?xml ... ?>") it works.

The problem is, I can't guarantee whether the client will send me a message with utf-8 or utf-16, and some of the sample messages I have from the client don't even have a prolog. So I thought the easiest solution is to simply remove the prolog before parsing it. Is there a way I can remove the prolog from the XML message using DOM? Or will I have to use basic string parsing and hope the XML is well-formed? OR will removing the prolog break everything and if so, is there a better way to handle this error?

Thanks for reading all my ramblings, I hope that made sense!

[2924 byte] By [geek_chicka] at [2007-11-26 19:05:59]
# 1

Don't mess about converting strings to bytes unless you use the correct encoding. I could show you what you are doing wrong but it's much more straightforward to pass the string directly to the parser so it doesn't have to be converted to bytes and back to chars again:Document document = builder.parse(new InputSource(new StringReader(sResponse)));

DrClapa at 2007-7-9 20:56:46 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 2

Thanks for the suggestion! However, that makes me wonder whether I am correctly converting between bytes and Strings in other parts of my code.

I originally get the message from MQ Queue in the form of either a TextMessage or BytesMessage. In either case, I store the message in a String variable. If it's originally a BytesMessage, I use BytesMessage.readBytes() to store the bytes in a bytes array, then create a string with "new String (bytesArray)".

This string gets passed through various methods, since there's several layers that we have to parse through and process (the XML in my original post is only one step of the process). There are a few times when we have to send the message to a method that takes a bytes array parameter (such as some methods for doing encryption), and I always just use the String.getBytes() method to do the conversion. Is there a problem with this since we don't know exactly what's in the String -- is there a better way to convert between Strings and byte arrays? Thanks in advance!

geek_chicka at 2007-7-9 20:56:46 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 3

> Thanks for the suggestion! However, that makes me

> wonder whether I am correctly converting between

> bytes and Strings in other parts of my code.

Well, no you aren't. In your original example your XML document claimed it was encoded in UTF-16. But when you converted the String to bytes you used the version of getBytes() that uses the system's default encoding. Which is not UTF-16, or anything close to it.

Your best bet is to never convert the bytes to String data unless there's a reason to do so. And it sounds like you don't have any particular reason to do that conversion.

DrClapa at 2007-7-9 20:56:46 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...