UTF-8->Unicode conversion fails in ByteToCharConverter

My test code converts a Unicode string to UTF-8 and back. The Unicode->UTF-8 seems to work, but UTF-8->Unicode fails. I copied the code from String's constructor taking the encoding to see if ByteToCharConverter was throwing an exception, add it was throwing a MalformedInputException. It chokes when it sees \u306e and all I get are the characters up to that.

Can anyone verify if my conversion is correct, otherwise is it a ByteToCharConversion bug?

TEST CODE:

String rstField = "Fran\u00e7ais 55 \uffe5 1\u6b21 2\u306e 3\u30da 4\u30fc 5\u30b8";

System.out.println("UNICODE: " + rstField);

rstField = new String(rstField.getBytes("UTF-8"));

System.out.println("UTF-8: " + rstField);

rstField = new String(rstField.getBytes(), "UTF-8");

System.out.println("UNICODE: " + rstField);

OUTPUT:

UNICODE: Fran ais 55 ? 1? 2? 3? 4? 5?

UTF-8: Fran ?ais 55? 1? 2 ? 3 ? 4 ?a 5 ? Y

UNICODE: Fran ais 55 ? 1? 2

[998 byte] By [rflahert] at [2007-9-26 2:30:16]
# 1

Hi Robert,

There's no bug. You are misusing the getBytes method. Here is what your code is doing:

rstField = new String(rstField.getBytes("UTF-8"));

This says to encode the rstField (unicode) String into an array of bytes using the UTF8 transformation format and then decode the byte array back into a unicode String using the default transformation format (presumably Cp1252). This is highly unlikely to produce anything useful.

rstField = new String(rstField.getBytes(), "UTF-8");

This says to encode the rstField (unicode) String into an array of bytes using the default transformation format (presumably Cp1252) and then decode the byte array back into a unicode String using the UTF8 transformation format.

regards,

Joe

joefk at 2007-6-29 9:48:51 > top of Java-index,Desktop,I18N...
# 2

The purpose of the code was to simulate what I was receiving as a request parameter from IE from a UTF-8 encoded form. Some people have changed the default encoding on the server, but I would prefer a programmatic solution. The string returned is exactly what comes from the first lien of code:

rstField = new String(rstField.getBytes("UTF-8"));

Each Japanese character is represented by 3 UTF-8 bytes, which is represented in a String by 3 double-byte characters. I simply want to combine them into a single Unicode character with UTF-8 conversion.

I modifed my code to dump out the bytes. You'll see that a single byte is different when reversed. If I change my input string to:

String rstField = "Fran\u00e7ais 55 \uffe5 1\u6b21 3\u30da 4\u30fc 5\u30b8";

removing the "2\u306e ", the conversion reverses correctly, so I think my code is valid. Try out the code below, see that the 25th byte is changed when reversed. Then remove the "2\u306e " and you'll see that it works. It's either ByteToCharConverter or CharToByteConverter.

String rstField;

byte[] raby;

rstField = "Fran\u00e7ais 55 \uffe5 1\u6b21 2\u306e 3\u30da 4\u30fc 5\u30b8";

System.out.println("UNICODE: " + rstField);

System.out.print(" ");

raby = rstField.getBytes("UTF-8");

for (int iIndex = 0; iIndex < raby.length; iIndex++)

{

System.out.print("" + raby[iIndex] + " ");

}

System.out.println("");

rstField = new String(rstField.getBytes("UTF-8"));

System.out.println(" UTF-8: " + rstField);

System.out.print(" ");

raby = rstField.getBytes();

for (int iIndex = 0; iIndex < raby.length; iIndex++)

{

System.out.print("" + raby[iIndex] + " ");

}

System.out.println("");

rstField = new String(rstField.getBytes(), "UTF-8");

System.out.println("UNICODE: " + rstField);

FYI, I'm using JDK 1.3.0_01 for Windows.

rflahert at 2007-6-29 9:48:51 > top of Java-index,Desktop,I18N...
# 3

Hi Robert,

Try the following code on your string. This is the correct way to test the UTF8 transformation:

String string1= "Fran\u00e7ais 55 \uffe5 1\u6b21 2\u306e 3\u30da 4\u30fc 5\u30b8";

String string2 = null;

byte [] bytes = null;

try {

bytes =string1.getBytes ("UTF8");//encode String->byte array

string2 = new String (bytes, "UTF8");//decode byte array -> String

} catch (UnsupportedEncodingException e) {

e.printStackTrace();

}

if (string2.equals (string1)) {

System.out.println ("the strings are the same");

}

Regards,

Joe

joefk at 2007-6-29 9:48:51 > top of Java-index,Desktop,I18N...