Converting UTF-8 to Unicode

I'm struggling with an attempt to convert arabic characters encoded in UTF-8 to Unicode characters and display them on a JTextArea.

The conversion is performed as following:

Stirng str = "<arabic text in UTF-8>";

ByteArrayInputStream bai = new ByteArrayInputStream(str.getBytes());

InputStreamReader isr = new InputStreamReader(bai, "UTF-8");

OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream("fil.txt"));

int c;

while ((c=isr.read())!=-1) {

writer.write((char)c);

}

writer.close();

Now, when I have done this and I open the file "fil.txt" in Microsoft Word, I will get Arabic text... but it is not in Unicode, when selecting Unicode as encoding scheme in MS Word it just shows strange characters (not arabic), MS Word can show the Arabic text using CP1256 or Arabic ISO (8859-6) but not when selecting Unicode.

The problem is that I need the text in unicode, I though that going from a ByteArrayInputStream though an InputStreamReader would convert from byte array to characters using Unicode, but it seems as it don't. Maybe I have done a simple rookie mistake.

Thanks in advance,

Andreas

Message was edited by:

andreas_selenwall

[1265 byte] By [andreas_selenwalla] at [2007-10-3 2:46:31]
# 1

I think your problem starts with your definition of a String that does not contain Unicode characters. A Java string is in Unicode by definition, so specifying UTF-8 code units will mess things up.

I also see that you don't specify an encoding when you create the byte array, you would need to do that if you ever have to do this (starting with a Unicode string).

Here are a couple of pages that should help:

http://java.sun.com/docs/books/tutorial/i18n/text/stream.html

http://java.sun.com/docs/books/tutorial/i18n/text/string.html

one_danea at 2007-7-14 20:35:15 > top of Java-index,Desktop,I18N...
# 2

Ok. I made a small mistake in the explaination. The string 'str' is actually a byte array where each byte in the array represents a UTF-8 code. What I have to work with is a string, for example, "0xb60xc30xb60x2e". Then I have made a method that converts this string to a byte array. So in the description code I wrote in my posting, you can change the str.getBytes() with just a byte array. So the byte array that is passed to the ByteArrayInputStream(...) is a byte array containing UTF-8 encoded bytes.

andreas_selenwalla at 2007-7-14 20:35:15 > top of Java-index,Desktop,I18N...
# 3

Ok. I made a small mistake in the explaination. The string 'str' is actually a byte array where each byte in the array represents a UTF-8 code. What I have to work with is a string, for example, "0xb60xc30xb60x2e". Then I have made a method that converts this string to a byte array. So in the description code I wrote in my posting, you can change the str.getBytes() with just a byte array. So the byte array that is passed to the ByteArrayInputStream(...) is a byte array containing UTF-8 encoded bytes.

But I do get some kind of conversion, but it is in ISO8859-6 and not unicode.

Message was edited by:

andreas_selenwall

andreas_selenwalla at 2007-7-14 20:35:15 > top of Java-index,Desktop,I18N...
# 4

I don't understand why you are doing what you are doing then. If you have a byte array in utf-8 encoding, why don't you just create the correct String object from that array, using the encoding parameter?

Something like

String str = new String(utf8Bytes, "UTF8");

And when you say:

>>So the byte array that is passed to the ByteArrayInputStream(...) is a byte >>array containing UTF-8 encoded bytes.

I get really confused.

You need to pass in a String object to the getBytes() method, not a byte array.

I may be missing something obvious about explanation, but it still seems to me that you are not using byte arrays and String objects correctly.

one_danea at 2007-7-14 20:35:16 > top of Java-index,Desktop,I18N...
# 5

In my JSP page, after getting chinese character input, the user select some other drop down, whcih makes the page to be refreshed.

When refreshed i cannot get the chinese word back. All are changed to "?eA...." something like this.

Can anyone help how to solve this problem.

Thanx

MNeelaa at 2007-7-14 20:35:16 > top of Java-index,Desktop,I18N...