characters conversion
Hello,
I'm working on a conversion problem between windows and unicode representations of characters.
I would like to get, for instance for the euro character, the windows encoding value (128) from its unicode encoding value (8364), and vice versa.
€8364 128(unicode / windows)
?8218 130
?402 131
?8222 132
?8230 133
?8224 134
...
I've found on the internet a way to get 128 from 8364 :
String s ="€";
byte b[] = s.getBytes();
int code = (int) (b[0] & 0xff);
By the way, if someone could explain me how it works... ;)
I'm looking now for a way to do the opposite, get 8364 from 128...
Thank you a lot in advance.
Bye!
[872 byte] By [
kinski9a] at [2007-11-27 9:33:17]

There are, in fact, a number of character encodings used by windows, according to the locale settings of the Windows machine. To find out the one your system uses open a cmd window and type chcp (change code page) which, called without arguments, gives you the code page setting for example mine says:
Active code page: 437
Java will normally know these encodings as, for example, cp437.
So, if that's the same as your machine, try
byte[] b = s.getBytes("cp437");
Whenever, in the java API, you convert between characters/Strings and bytes (especially when reading or writing files and other streams) a character encoding is used, by default the one that the JVM things your system uses.
Java Strings are represented internally as UTF-16. When it's in memory, that's it's only possible encoding. When you write the String to another source, that's when you can convert it to another encoding. If the String stays in a java String variable, then there's no conversion you can do to it.
I assume that you're trying to convert the unicode chars to Windows-1252, right? You can use this code to write them to a file, for example.
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(FileOutputStream("outfilename"), "windows-1252");
bw.write("€們剠?);
bw.close();
Thank you for your answers.
hunter90000, you're right, the "windows" charset is windows-1252. But what I'm looking for is a way to get 8364 from 128. If someone could help me unterstand the code below, I think I could find it :
String s = "€";
System.out.println( (int) s.charAt(0));--> 128
byte b[] = s.getBytes();
int code = (int) (b[0] & 0xff);
System.out.println(code);-->8364
System.out.println((char) code);-->€
The reason why I want to get 8364:
I'm manipulating a xml file to send data to a web browser via an ajax function. This data provide from an Oracle database, in which the euro character has the value 128.
The only way I've found to display correctly the euro character on the browser is to encode it & # 8364; in the xml file, even ifthe charset of this one and of the JSP is 'ISO-8859-15'...
The problem is not limited to the euro character, but to all the carachters in the following list :
(8364 . 128)
(8218 . 130)
(402 . 131)
(8222 . 132)
(8230 . 133)
(8224 . 134)
(8225 . 135)
(710 . 136)
(8240 . 137)
(352 . 138)
(8249 . 139)
(338 . 140)
(381 . 142)
(8216 . 145)
(8217 . 146)
(8220 . 147)
(8221 . 148)
(8226 . 149)
(8211 . 150)
(8212 . 151)
(732 . 152)
(8482 . 153)
(353 . 154)
(8250 . 155)
(339 . 156)
(382 . 158)
(376 . 159)
So you need to take the data from the db, which is encoded in 1252, and convert it to Unicode (U+20AC specifically).
What form is the data coming from the db? Byte array, or chars? If it's a byte array, you can use this String constructor:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#String(byte[],%20java.lang.String)
with "Windows-1252" as the encoding. That will create a java String (UTF-16, although you don't need to worry about that).
Then you can write it out to the xml as UTF-8 (assuming that's the encoding of your xml doc) using the method I gave before. I can't think off the top of my head if you need to encode the UTF-8 char as € manually or if the jsp can encode it.
And come to think of it, why are you using ISO-8859-1 for your jsp anyway? UTF-8 is much more flexible and doesn't have the limitations that 8859 does. If every step of your program uses UTF-8, you have that much less conversion to do between charsets.
And since you don't have a choice :), read this while you're at it:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html