Cross-platform String encoding problem.

Given the following code, which converts "Hello, World!" from default encoding to Cp1047 (a flavor of EBCDIC) back to default encoding:

import java.io.UnsupportedEncodingException;

publicclass EbcdicTest

{

public EbcdicTest()

{

String hello ="Hello, World!";

try

{

String ebcdic =new String(hello.getBytes("Cp1047"));

System.out.println(ebcdic);

String uni =new String(ebcdic.getBytes(),"Cp1047");

System.out.println(uni);

}

catch(UnsupportedEncodingException e)

{

System.err.println("Cp1047 not supported :(");

}

}

/**

* @param args

*/

publicstaticvoid main(String[] args)

{

new EbcdicTest();

}

}

When I run this on Win32, the output is:

╚圄酐k@蝶拄鋃

Hello, World!

However, running on RHEL (Intel hw), I get:

?k@?Z

, !

Now, I expect the default console encoding explains why the Cp1047 encoded String renders differently. But I'm at a loss why it is different when it's converted back to "Hello, World!".

Any ideas?

[1924 byte] By [bckrispia] at [2007-10-3 5:21:38]
# 1

Unless I've got the parens mixed up, this is invalid:

String ebcdic = new String(hello.getBytes("Cp1047"));

It will perform the conversion from the "hello" string to a byte array, using the specified encoding. Then it will try to take those bytes and create a string from them using the platform-specific encoding. I don't know EBCDIC, but would guess that some EBCDIC character is a bozo sequence in the default charset of your Linux box.

Instead, you should be doing the conversion like this:

byte[] ebcdic = hello.getBytes("Cp1047");

String uni = new String(ebcdic,"Cp1047");

How you print the EBCDIC bytes is up to you.

kdgregorya at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...
# 2

I may be totally wrong but shouldn't it be

String uni = new String(ebcdic.getBytes("Cp1047"),"Cp1047");

Because ebcdic is a String that used CP1047 encoding (in theory). So when you get the bytes don'e you need to get them as that encoding? And not the system default?

Uni is supposedly going to be a String with Cp1047 but it isn't...

I *think* you output with the system default encoding and then tell it it's something else. And that's wrong. I think.

cotton.ma at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...
# 3
> Unless I've got the parens mixed up, this is<scratches head/>Your example could be String uni = new String(hello.getBytes("Cp1047"),"Cp1047");Just the same right?
cotton.ma at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...
# 4

Hmm, interesting. That seems to work, thank you. As well, I found this in the bug database: [url]http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4886772[/url].

Unfortunately, this doesn't fix the root problem i have (this program was just a prototype test). The problem is that a system I talk to has a bug whereby a stream of EBCDIC encoded data is built into a Java String without specifying the correct encoding to apply, i.e, the data already exists as a String object, and not an array of bytes. I'm trying to massage this String into an encoding that I can grok.

Any ideas if this is possible?

bckrispia at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...
# 5

> Unfortunately, this doesn't fix the root problem i

> have (this program was just a prototype test). The

> problem is that a system I talk to has a bug whereby

> a stream of EBCDIC encoded data is built into a Java

> String without specifying the correct encoding to

> apply, i.e, the data already exists as a String

> object, and not an array of bytes. I'm trying to

> massage this String into an encoding that I can

> grok.

>

> Any ideas if this is possible?

Mightn't it already be too late?

cotton.ma at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...
# 6

> Hmm, interesting. That seems to work, thank you. As

> well, I found this in the bug database:

> [url]http://bugs.sun.com/bugdatabase/view_bug.do?bug_

> d=4886772[/url].

>

> Unfortunately, this doesn't fix the root problem i

> have (this program was just a prototype test). The

> problem is that a system I talk to has a bug whereby

> a stream of EBCDIC encoded data is built into a Java

> String without specifying the correct encoding to

> apply, i.e, the data already exists as a String

> object, and not an array of bytes. I'm trying to

> massage this String into an encoding that I can

> grok.

Strings always contain unicode - always.

So if you have a java string it has unicode - nothing else.

If someone started with some character set an incorrectly encoded it into a java string then the original characters do not exist in the string.They always get mapped to unicode.

Emphasizing that again - the mappling ALWAYS occurs.

So if there was no mapping then the values in unicode will be '?', in the string itself. It would not be a display problem on your end.

So you would not be starting with anything that, given no other constraint, could be mapped back to the original encoding.

Now if you can apply another constraint, like that it is a kwown/fixed set of strings then you could create an algorithm the maps it.

jschella at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...
# 7
^ Ok, that makes sense. Much appreciated.
bckrispia at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...
# 8

> String uni = new

> String(hello.getBytes("Cp1047"),"Cp1047");

>

> Just the same right?

Exactly. My point was that it was incorrect to perform the conversion as shown.

To bckrispi: while I agree with jschell that some encoding conversion must have occurred to produce the string, there's a big question of exactly what encoding was applied. If the encoding was ISO-8859-1, then the bytes were unchanged when they were converted to Unicode. You can extract those bytes, and apply the correct encoding to get back.

If, however, it was any other encoding, you're stuck.

So, assuming that str is the input string, where EBCDIC bytes got dumped, try this (forgive any typos) and see if you get back what you expect:

byte[] bb = str.getBytes("iso-8859-1");

String ss = new String(bb, "Cp1047");

Edit: as I think about this some more, you should be able to retrieve any encoding that has a 1:1 correspondence to Unicode. For example, ISO-8859-9 translates byte D0 into U011E; this is reversible (although I'm not sure that it translates the non-printing characters). On the other hand, UTF-8 is not reversible (because the input could contain invalid UTF-8 byte sequences).

My suggestion remains finding out the original incorrect incoding, then trying to determine if it's reversible for all values.

kdgregorya at 2007-7-14 23:28:36 > top of Java-index,Java Essentials,Java Programming...