Effective Unicode Handling

Hallo all,

I am reading a datafile which will return me a string -> \u6210 . I want to convert this string into it's UTF-8 Encoding.

"\u6210".getBytes("UTF-8");

This is the normal way to do it i believed .

But when i try this from the input datafile. It's reading the '\u' unicode escape seq as a character and not an escape seq. Is there anyway to force the compiler to take the '\u' as an esc seq ?

Thanks for all the help

[493 byte] By [irwin74] at [2007-9-26 1:30:19]
# 1
Hi,Can you give more specific details about what you are describing in this line?>>But when i try this from the input datafile<<Regards,Joe
joefk at 2007-6-29 1:28:13 > top of Java-index,Desktop,I18N...
# 2
I used a JAXP parser to read in an XML Document which return me a string type \u6210 Sample of my data file file as follows :<Firstname>\u6210</Firstname>Thanks
irwin74 at 2007-6-29 1:28:13 > top of Java-index,Desktop,I18N...
# 3
I still don't understand how the compiler comes into play? (In your original message) Are you loading .java or .properties files into the java reader over the Internet?Regards,Joe
joefk at 2007-6-29 1:28:13 > top of Java-index,Desktop,I18N...
# 4
that is, is your program receiving the \u6210 string at compile time or at run time?
joefk at 2007-6-29 1:28:13 > top of Java-index,Desktop,I18N...
# 5

Well my datafile has the value \u6210 hardcoded as an xml content.

What i am trying to do here is to read this xml data file , which will return me the string '\u6210' w/o quote and convert this string into the encoding format which i want.

i got a string data type back which is \u6210 , when i use this string and do a getBytes function which the encoding , the compiler is interpreting '\u' as a string type too and not the unicode escape sequence

In this manner i get the wrong encoding format .

Thanks

irwin74 at 2007-6-29 1:28:13 > top of Java-index,Desktop,I18N...
# 6

Ok, I'm only slightly familiar with JAXB, but here's what I think is going on:

Interpretation of a six character escape sequence (e.g. \u6210) as a single unicode character is built into the java reader so that if such a sequence is in a *.java file it is compiled as a unicode character by javac or if such a sequence is present in a *.properties file it is interpreted as a single unicode character when the java interpreter invokes ResourceBundle.getBundle at runtime.

However, to the best of my knowledge, this capability (of converting a six character escape sequence into a single unicode character) is *not* present anywhere in the java 2 API. Therefore, when your program reads the XML data file using the unmarshall method it will treat your sequence as a six character string. If you want to store unicode characters above 255 in your XML file there would be two ways to do it: 1)does the unmarshall method have an optional encoding argument? (I don't have its documentation); 2) you could write your own method to decode the six character escape sequence into a single unicode character.

Regards,

Joe

P.S. post back if I'm not being clear.

joefk at 2007-6-29 1:28:13 > top of Java-index,Desktop,I18N...