UTF-8 encoded char array?

Hi,

I have to save a string with UTF-8 encoding (using the sax handler ContentHandler.characters()) method. This takes a char array as a parameter. If I have a string and get the characters from it using:

str.toCharArray()

this won't return a UTF-8 encoded char array, will it? And how do I get a UTF-8 encoded char array?

thanks,

J

[370 byte] By [justinlawlera] at [2007-11-27 11:57:21]
# 1

No, it won't. But that's because there is no such thing as a "UTF-8 encoded char array". Every char in Java is a Unicode character. You can get a UTF-8 encoded byte array by using str.getBytes("UTF-8").

However I think your requirements are fouled up. If you're calling that characters() method, then just pass it a String. Whatever serializer is being driven by the ContentHandler will take care of the chars-to-bytes conversion. At least that's what I think based on the details you posted. Would you like to post the background that made you come up with that requirement?

DrClapa at 2007-7-29 19:13:18 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 2

Thanks for that. Getting a little confused with my character encodings...

Isn't there only a method on ContentHandler that takes char array though, and not a string?

Were just writing out an xml file, and somewhere along the line it's getting messed up, cause when we read it back in, it throws an exception:

org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.

at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

That's why I was thinking it was messing up when it was writing out. It's the exact same file, so weird that its writing out ok, but throwing an error on reading it in?

thanks,

J

justinlawlera at 2007-7-29 19:13:18 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...
# 3

> That's why I was thinking it was messing up when it

> was writing out.

Yes, that is a correct diagnosis of the problem. But you're looking in the wrong place to fix the problem. You need to look at a higher level. For example it's quite possible that you are writing to a FileWriter and using your system's default encoding, but not specifying that encoding in the output document's prolog. That would cause the error message you posted.

So have a look at the way you write the output document. Use OutputStream in preference to Writer here so that your serializer (whatever's writing the document) has a chance of writing with the correct encoding. Post the relevant code if you still need help.

DrClapa at 2007-7-29 19:13:18 > top of Java-index,Enterprise & Remote Computing,Enterprise Technologies...