How to encode Text in ANSI?

I try to encode/decode text that was written/read by a Visual Basic application. My code works for most special characters, but not for all - e.g. the EURO sign (€). In visual basic, the ASC() function returns 128 as the code which is correct when I look at the ANSI table at http://www.torsten-horn.de/techdocs/ascii.htm

But my Java code returns -128 (and other different numbers for other special characters:

final CharsetEncoder encoder = Charset.forName(encoding).newEncoder();

final String s =new String(plainText.getBytes(encoding), encoding);

System.out.println("canEncode: "+encoder.canEncode(s));

final ByteBuffer bb = encoder.encode(CharBuffer.wrap(s));

finalbyte[] text = bb.array();

final StringBuilder sb =new StringBuilder();

sb.append(new String(newbyte[]{text[i]}));//charatcer

sb.append(" (" + (int) text[i] +") : ");//code

System.out.println(sb.toString());

The output is:

VB: € (128)

Java: € (-128)

BTW: using Unicode/AscW() is no option as it doesn't work with other Microsoft components (Database, GUI controls, ...).

[1722 byte] By [MartinHilperta] at [2007-11-27 8:08:00]
# 1
What's the encoding you use?
quittea at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 2
Ah, wait. Afaics it's ok; -128 is the byte value for binary 10000000 (which is the same as 128 without 2's complement). Positive 128 does not fit into a byte.Try this instead:sb.append(" (" + (0x00ff & text[i]) +") :
quittea at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 3

> I try to encode/decode text that was written/read by

> a Visual Basic application. My code works for most

Just use the standard string encoding/decoding with CP1252 character set.

http://en.wikipedia.org/wiki/Cp1252

> the ANSI table at http://www.torsten-horn.de/techdocs/ascii.htm

Personally, I wouldn't rely too much on an "ANSI table." An "ASCII table," maybe, although not for any code point values above 127. An "ANSI control sequences table," maybe, but that has nothing to do with character encoding.

So let's look at what you're actually doing ...

> final CharsetEncoder encoder = Charset.forName(encoding).newEncoder();

> final String s = new

> String(plainText.getBytes(encoding), encoding);

So here you get some unspecified encoder, take a string (that's in Unicode), try to use the encoder to decode that string into bytes, and then use it to encode back into characters. And the point of this little snippet is?

> final StringBuilder sb = new StringBuilder();

> sb.append(new String(new byte[] {text[i]}));

And here you take an array of arbitrary bytes, and use the platform's default encoder to turn those into a string.

Martin, I know that you get all pissy when people tell you that you don't have a clue, but really.

> BTW: using Unicode/AscW() is no option as it

> doesn't work with other Microsoft components

> (Database, GUI controls, ...).

Well, that's a problem, as all Java strings are in unicode.

Start here: http://www.joelonsoftware.com/articles/Unicode.html

And then once you understand what Unicode really is, take a closer look at the Wikipedia article above (it's going to tell you most of what you need to know about Windows encoding), and finally, spend some time trying to understand what the Charset API does.

Captain.Obviousa at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 4
Thanks, but this only works in the StringBuffer which is only for debugging. When I try to use this in the working (final) code:text[i] = (byte) (text[i] & 0x00ff);it doesn't work anymore.
MartinHilperta at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 5

> Martin, I know that you get all pissy when people

> tell you that you don't have a clue, but really.

You are one of those people thinking that you know everything and just tell others that their code is stupid but don't show any working examples. So, your post is useless.

MartinHilperta at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 6

> > text[i] = (byte) (text[i] & 0x00ff);

>

Sure, get rid of the cast in this case, as you're casting a value to a byte which is out of a byte's range. Performing bitwise AND only makes sense when you're trying to convert a binary byte to an int/long.

quittea at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 7

> You are one of those people thinking that you know

> everything and just tell others that their code is

> stupid but don't show any working examples. So, your

> post is useless.

So telling you that Windows programs use CP1252 is useless?

Giving you links to articles that can actually help you is useless?

And if you're going to post code, some of us assume that you actually want critique. Of course, if you talk about writing an encoder and then don't actually post the code for that encoder, it's hard to critique the encoder.

OK, let's try it this way: yes Martin, your code is perfect. You are perfect. We are not worthy.

Captain.Obviousa at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 8

> You are one of those people thinking that you know

> everything and just tell others that their code is

> stupid but don't show any working examples. So, your

> post is useless.

You are one of those people that posts a question but refuses to take any advice. So, your threads are useless.

~

yawmarka at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 9
Woah, I didn't expect the Spanish inquisition :)
quittea at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 10
Nobody expects the Spanish inquisition!!!
-Kayaman-a at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 11

> > > > text[i] = (byte) (text[i] & 0x00ff);

> >

>

> Sure, get rid of the cast in this case, as you're

> casting a value to a byte which is out of a byte's

> range. Performing bitwise AND only makes sense when

> you're trying to convert a binary byte to an int/long.

Yes, but I have to convert those bytes back to bytes/charatcers to have an (encoded) string as the result.

MartinHilperta at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 12
> Yes, but I have to convert those bytes back to> bytes/charatcers to have an (encoded) string as the result.new String(text, "Cp1252");
quittea at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 13
> I try to encode/decode text that was written/read by a Visual Basic application.You know, there are classes called InputStreamReader and OutputStreamWriter which can handle decoding/encoding of Strings.
thomas.behra at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 14
> You know, there are classes called InputStreamReader> and OutputStreamWriter which can handle decoding/encoding of Strings.Even a simple FileReader/FileWriter should do the job, as Windows-1252 is the platform's default encoding for the respective locale.
quittea at 2007-7-12 19:50:48 > top of Java-index,Java Essentials,Java Programming...
# 15

>

> Yes, but I have to convert those bytes back to

> bytes/charatcers to have an (encoded) string as the

> result.

Java draws a strong line between characters and bytes. If it's characters, it's UNICODE, when it's bytes (or on a file) it's in some specified encoding, for example one of the extended ASCII encodings. (ASCII does not define meanings for bytes 128-255.)

For windows it's "code pages". The meaning of bytes 128-255 on Windows depends on the locale settings of the operating system.

Whenever you convert between characeters/Strings and byte arrays or streams the encoding/decoding operation is always implied, if not explicitly given. You should never need to muck about with integer values of character or bytes.

If you stick to this scheme things generally work out fine. You just have to be sure you're using the right encoding for your input/output.

The encoding you'll effectively get by clipping UNICODE to 8 bits is ISO-8859-1. This is probably not the encoding you want. Some of the characters in your windows encoding will decode to character values about 255, which shouldn't bother you a bit.

malcolmmca at 2007-7-21 22:30:23 > top of Java-index,Java Essentials,Java Programming...
# 16

Well, the exact problem is, that Visual Basic writes crypted text into a MS SQL Serevr Database. And my Java app needs to read this encrypted text (via JDBC from the database) and decrypts it. So I tried to migrate the encryption/decryption code to Java but got into these ANSI encoding problem that got wrong characters for the EURO sign and some others.

So, no text file or such. With the hint of quitte (... & 0x00ff) I was able to encrypt the text to the same codes as the Visual Basic test app. But decrypting with the equivalent code doesn't work. The encrypted text somehow gets changed. I return the result in my encryption code via

result = new String(encryptedBytes, encoding);

and the decryption method gets this string and does a

final byte[] text = cryptedText.getBytes(encoding);

and the text bytes has different values than the encryptedBytes byte array from the encryption method.

MartinHilperta at 2007-7-21 22:30:23 > top of Java-index,Java Essentials,Java Programming...
# 17
> Well, the exact problem is, that Visual Basic writes> crypted text into a MS SQL Serevr Database. And my> Java app needs to read this encrypted text (via JDBC> from the database) and decrypts it. Read it back as bytes and not not not characters.
sabre150a at 2007-7-21 22:30:23 > top of Java-index,Java Essentials,Java Programming...
# 18
> > result = new String(encryptedBytes, encoding);> I don't know what encryption algorithm you use, but I assume the result is binary (contains bytes in ANSI's control code areas) and should IMHO not be converted back to a String.
quittea at 2007-7-21 22:30:23 > top of Java-index,Java Essentials,Java Programming...
# 19

If the data is really encrypted it's basically no longer character data until you decrypt it (unless it's been "armoured" using Base64 or hex format). So treat it as bytes.

You'll need to decrypt it before interpretting it as character data.

OTOH it doesn't really sound like it is, from what you're doing, rather it's merely encoded.

What you do sometimes from database drivers is characters wrongly converted to strings through an incorrect encoding. This tends to depend what kind of encoding you requested when you created the database. In that case you need to correct the encoding. Your best bet is probably to read the data as bytes as suggested. If you do have a String wrongly converted you can correct it by changing it to bytes using one encoding, then back to String with the other.

For example new String(rs.getString(1).getBytes("ISO-8859-1"), encoding);

If the characters are really encoded by "encoding" but the database driver has converted them by simply clipping to 8 bits.

It's a nasty cludge, but it works.

malcolmmca at 2007-7-21 22:30:23 > top of Java-index,Java Essentials,Java Programming...
# 20

Okay, I got the problem. As the visual basic code doesn't have this Java-Unicode-String problem, the VB functions can work with (VB) strings.

I finally got it to work by only converting the plain text once via

final byte[] text = plainText.getBytes(encoding);

and continue working with byte arrays. The result of encrypt() is a byte[] and decrypt(byte[], encoding) requires a byte array and the encoding. So I have to read the bytes from the database and hope that they will get read correctly ...

Thank you very much, quitte, sabre150 & malcolmmc!

MartinHilperta at 2007-7-21 22:30:23 > top of Java-index,Java Essentials,Java Programming...