UTF-8->Unicode conversion fails in ByteToCharConverter
My test code converts a Unicode string to UTF-8 and back. The Unicode->UTF-8 seems to work, but UTF-8->Unicode fails. I copied the code from String's constructor taking the encoding to see if ByteToCharConverter was throwing an exception, add it was throwing a MalformedInputException. It chokes when it sees \u306e and all I get are the characters up to that.
Can anyone verify if my conversion is correct, otherwise is it a ByteToCharConversion bug?
TEST CODE:
String rstField = "Fran\u00e7ais 55 \uffe5 1\u6b21 2\u306e 3\u30da 4\u30fc 5\u30b8";
System.out.println("UNICODE: " + rstField);
rstField = new String(rstField.getBytes("UTF-8"));
System.out.println("UTF-8: " + rstField);
rstField = new String(rstField.getBytes(), "UTF-8");
System.out.println("UNICODE: " + rstField);
OUTPUT:
UNICODE: Fran ais 55 ? 1? 2? 3? 4? 5?
UTF-8: Fran ?ais 55? 1? 2 ? 3 ? 4 ?a 5 ? Y
UNICODE: Fran ais 55 ? 1? 2
[998 byte] By [
rflahert] at [2007-9-26 2:30:16]

Hi Robert,
There's no bug. You are misusing the getBytes method. Here is what your code is doing:
rstField = new String(rstField.getBytes("UTF-8"));
This says to encode the rstField (unicode) String into an array of bytes using the UTF8 transformation format and then decode the byte array back into a unicode String using the default transformation format (presumably Cp1252). This is highly unlikely to produce anything useful.
rstField = new String(rstField.getBytes(), "UTF-8");
This says to encode the rstField (unicode) String into an array of bytes using the default transformation format (presumably Cp1252) and then decode the byte array back into a unicode String using the UTF8 transformation format.
regards,
Joe
joefk at 2007-6-29 9:48:51 >

The purpose of the code was to simulate what I was receiving as a request parameter from IE from a UTF-8 encoded form. Some people have changed the default encoding on the server, but I would prefer a programmatic solution. The string returned is exactly what comes from the first lien of code:
rstField = new String(rstField.getBytes("UTF-8"));
Each Japanese character is represented by 3 UTF-8 bytes, which is represented in a String by 3 double-byte characters. I simply want to combine them into a single Unicode character with UTF-8 conversion.
I modifed my code to dump out the bytes. You'll see that a single byte is different when reversed. If I change my input string to:
String rstField = "Fran\u00e7ais 55 \uffe5 1\u6b21 3\u30da 4\u30fc 5\u30b8";
removing the "2\u306e ", the conversion reverses correctly, so I think my code is valid. Try out the code below, see that the 25th byte is changed when reversed. Then remove the "2\u306e " and you'll see that it works. It's either ByteToCharConverter or CharToByteConverter.
String rstField;
byte[] raby;
rstField = "Fran\u00e7ais 55 \uffe5 1\u6b21 2\u306e 3\u30da 4\u30fc 5\u30b8";
System.out.println("UNICODE: " + rstField);
System.out.print(" ");
raby = rstField.getBytes("UTF-8");
for (int iIndex = 0; iIndex < raby.length; iIndex++)
{
System.out.print("" + raby[iIndex] + " ");
}
System.out.println("");
rstField = new String(rstField.getBytes("UTF-8"));
System.out.println(" UTF-8: " + rstField);
System.out.print(" ");
raby = rstField.getBytes();
for (int iIndex = 0; iIndex < raby.length; iIndex++)
{
System.out.print("" + raby[iIndex] + " ");
}
System.out.println("");
rstField = new String(rstField.getBytes(), "UTF-8");
System.out.println("UNICODE: " + rstField);
FYI, I'm using JDK 1.3.0_01 for Windows.
Hi Robert,
Try the following code on your string. This is the correct way to test the UTF8 transformation:
String string1= "Fran\u00e7ais 55 \uffe5 1\u6b21 2\u306e 3\u30da 4\u30fc 5\u30b8";
String string2 = null;
byte [] bytes = null;
try {
bytes =string1.getBytes ("UTF8");//encode String->byte array
string2 = new String (bytes, "UTF8");//decode byte array -> String
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
if (string2.equals (string1)) {
System.out.println ("the strings are the same");
}
Regards,
Joe
joefk at 2007-6-29 9:48:51 >
