What is wrong with my UTF-16 writer?

I'm using

Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0)

Classic VM (build 1.3.0, J2RE 1.3.0 IBM build cx130-20001025 (JIT enabled: jitc))

and found the following behaviour quite strange. Take the following code example:

import java.io.*;

publicclass UTFTest{

publicstaticvoid main(String args[]){

try{

PrintWriter out =new PrintWriter (new OutputStreamWriter (

new FileOutputStream ("test",false),"UTF-16"));

out.println("Test");

out.close();

out =new PrintWriter (new OutputStreamWriter (

new FileOutputStream ("test",true),"UTF-16"));

out.println("Test2");

out.close();

BufferedReader in =new BufferedReader (new InputStreamReader (

new FileInputStream ("test"),"UTF-16"));

String line = in.readLine();

while(line!=null){

System.out.println(line);

line=in.readLine();

}

in.close();

}

catch (IOException e){ System.out.println(e.toString());}

}

}

that writes a string to a text file, closes the text file, reopens it appendig another string, closes it and prints its contents afterwards.

By doing so, I get the following output:

tb@abc:~ > javac UTFTest.java

tb@abc:~ > java UTFTest

Test

?Test2

tb@abc:~ >

Now the simple question is: Where does the question mark '?' in front of "Test2" come from? What can I do about it? I need to append Text to a UTF-16 encoded file in my application, keeping the file opened is no solution (although it would work in this example).

[3003 byte] By [tbraun] at [2007-9-26 3:13:22]
# 1

The actual bytes that are written to the file (on my system) are:

FE FF 00 54 00 65 00 73 00 74 00 0D 00 0A FE FF 00 54 00 65 00 73 00 74 00 00 32 0D 00A

(for the hex illiterate: some junk, "Test" as unicode, a windows style line break, the same junk again, "Test2" as unicode and another line break.)

The question is, where do we get "FE FF", what it means and why it isn't printed the first time.

jsalonen at 2007-6-29 11:22:38 > top of Java-index,Archived Forums,Java Programming...
# 2

Thank you for trying out and your feedback! At least not only IBM's jdk under linux is affected.

My first idea was that there is some kind of header written at the beginning of a newly created UTF16 file (FE FF) which is mistakenly written again when data is appended to the file. Reopening and reading from the file the second FE FF is taken as data instead of header.

So can anybody think of a simple workaround?

tbraun at 2007-6-29 11:22:39 > top of Java-index,Archived Forums,Java Programming...
# 3

the unicode nameslist says:FEFFZERO WIDTH NO-BREAK SPACE

= BYTE ORDER MARK (BOM)

* may be used to detect byte order by contrast with FFFE which is not a character

x (<not a character> - FFFE)

x (zero width space - 200B)

So, the real problem seems to be with InputStreamReader (or its char converter) that doesn't recognize FEFF (or maybe it is designed to *not* change the used char enc on the fly?).

jsalonen at 2007-6-29 11:22:39 > top of Java-index,Archived Forums,Java Programming...
# 4

my workaround

int no = in.read();

while(no > -1) {

if( Integer.toHexString(no).equalsIgnoreCase("FEff") )

System.out.println(Integer.toHexString(no)); // Skip it

else

System.out.print((char)no);

no = in.read();

}

in.close();

hope it helps

insoft at 2007-6-29 11:22:39 > top of Java-index,Archived Forums,Java Programming...
# 5

You'll find the bug report 4391896 interesting. Maybe mixing different encodings in one file in general is not such a good idea. (although you are using utf16 all the time...)

Maybe there's another workaround: reading the file like it was written. Read one line with one InputStreamReader and the other line with another InputStreamReader... and you can't use BufferedReader because it fills its buffer from the underlaying stream. Or making a subclass of InputStreamReader that doesn't write FEFF or FFFE to the stream...

> if( Integer.toHexString(no).equalsIgnoreCase("FEff") )

isn't it better to do the compare with ints? if (no == 0xfeff) ...

jsalonen at 2007-6-29 11:22:39 > top of Java-index,Archived Forums,Java Programming...
# 6

No, wait a little...

http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc

This is what they tell about the encoding UTF-16: "Sixteen-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark"

You get no problems if you use the encodings "UTF-16BE" or "UTF-16LE" because they wont emit the BOM.OutputStream outStream = new FileOutputStream ("test", false);

PrintWriter out = new PrintWriter (new OutputStreamWriter(outStream, "UTF-16BE"));

(reading has to be done with the same encoding, of course)

jsalonen at 2007-6-29 11:22:39 > top of Java-index,Archived Forums,Java Programming...
# 7
wow, you guys really were a great help! thanks a lot!! now it works as I expected!
tbraun at 2007-6-29 11:22:39 > top of Java-index,Archived Forums,Java Programming...