UNIX character set problems.

I don't believe the following problem exists on Windows...

I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default

character set is EUC-JP

I'm seeing two strange things when using Japanese character sets...

1) If I write a program that does

System.out.println("$^%$%^^" ); //assume those are Japanese characters that

are multibyte under EUC-JP

The resulting output looks NOTHING like the characters I typed in.

Apparently the character set being used to read the literal is different

from the default. I'm sitting in from of the Japanese machine using the Japanese keyboard.

2) If I create an XML document using the built in DOM which contains

elements with values in Japanese, I get strangeness when I transform that

into an XML document. If I do not set the character set in the transformer

the document will say its in UTF-8 (the XML header will). However, the

actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities

(it knows nothing of XML, just character sets) and when I try to read the

document when telling uconv it is UTF-8 it claims it is invalid UTF-8.

However, if I try to read it telling it the document is EUC-JP it says its

good.

Also, when I change the transformer to use EUC-JP it creates the same

document bit-for-bit (other than changing the XML header to say EUC-8).

Other character sets (UTC, etc) result in a different document.

So, my conclusion is that by default the XML DOM says its UTF-8 in the

header, but ALWAYS uses the platform default unless you specify something

else (UTC for example).

Has anyone else seen this?

Here is my transformer...

Document new_document = documentBuilder.parse("japan2.xml");

System.out.println("I just read japan2.xml");

DOMSource new_source = new DOMSource(new_document);

StringWriter new_writer = new StringWriter();

StreamResult new_result = new StreamResult(new_writer);

Properties p = transformer.getOutputProperties();

//try explicit EUC

//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)

//p.setProperty(OutputKeys.ENCODING,

//new OutputStreamWriter(new

ByteArrayOutputStream()).getEncoding());

//try UTF explicityly

//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);

Properties p2 = transformer.getOutputProperties();

p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();

System.out.println("XML doc is "+new_text_doc );

Resulting document...

XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq

confirmed="true"

invokeId="2"><AlertList><Alert><Name>ja_alert-とちつなのに</Name><AffectedObjects

type="Obj"><Obj><Name>ja_mo-あえいおう</Name></Obj></AffectedObjects><Properties><Property><Name>Severity</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr

name="TOD"><Int32>1112980583</Int32></Attr><Attr

name="DMPAlarmObject"><Str>ja_mo-あえいおう</Str></Attr><Attr

name="CLASS"><Str>NetExpert</Str></Attr><Attr

name="MANAGER"><Str>NetExpert</Str></Attr><Attr

name="DMPAlarmName"><Str>ja_alert-とちつなのに</Str></Attr><Attr

name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr

name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr

name="MsgType"><Str>Alarm</Str></Attr><Attr

name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr

name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>

When I try to read it using IBM's ICU character set tool uconv I get the

following...

=> uconv -f UTF-8 ~/test/xml/japan.xml

Conversion to Unicode from codepage failed at input byte position 116.

Bytes: a4 Error: Illegal character found

<?xml version="1.0" encoding="UTF-8"?>

<GenAlertsReq confirmed="true"

invokeId="1"><AlertList><Alert><Name>ja_alert-

However, when I tell it the document is EUC-JP it works...

=> uconv -f EUC-JP ~/test/xml/japan.xml

<?xml version="1.0" encoding="UTF-8"?>

<GenAlertsReq confirmed="true" invokeId=......

So, the document appears to be EUC-JP even though the Java DOM says its

UTF-8

-Robert>

[4999 byte] By [RobertGary1a] at [2007-10-1 11:04:43]
# 1

Hello Robert.

1) Make sure the font file used by java for displaying the console output has the symbols you want to print. If the font has the symbols, then try to specify to the compiler(javac) the character encoding of your java sources, using the "Encoding". Try to compile with

javac -encoding ISO-8859-1 MySource.java. Replace the encoding with the encoding of your sources. Another thing you may want to try out is to try to display the Japanese characters in System.out.println() by specifiying their unicode code using the escape sequences of the form \uxxxx, where xxxx is one to four hexidecimal digits. For example try

System.out.println("Hello \u1234 world !");

2)Regarding the second problem:

Personaly I did not worked with the XML APIs, but from my experience working with text files in various encodings it is very important to specify the encodings used for your Strings. There is a constructor in the String class which allows you to specify the encoding of the text which will be contained by the String instance. If you say that by specifying EUC-JP to the transformer produces a correct XML document, leave that way.

Theodor-Ba at 2007-7-10 3:33:37 > top of Java-index,Desktop,I18N...