What is wrong with my UTF-16 writer?
I'm using
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0)
Classic VM (build 1.3.0, J2RE 1.3.0 IBM build cx130-20001025 (JIT enabled: jitc))
and found the following behaviour quite strange. Take the following code example:
import java.io.*;
publicclass UTFTest{
publicstaticvoid main(String args[]){
try{
PrintWriter out =new PrintWriter (new OutputStreamWriter (
new FileOutputStream ("test",false),"UTF-16"));
out.println("Test");
out.close();
out =new PrintWriter (new OutputStreamWriter (
new FileOutputStream ("test",true),"UTF-16"));
out.println("Test2");
out.close();
BufferedReader in =new BufferedReader (new InputStreamReader (
new FileInputStream ("test"),"UTF-16"));
String line = in.readLine();
while(line!=null){
System.out.println(line);
line=in.readLine();
}
in.close();
}
catch (IOException e){ System.out.println(e.toString());}
}
}
that writes a string to a text file, closes the text file, reopens it appendig another string, closes it and prints its contents afterwards.
By doing so, I get the following output:
tb@abc:~ > javac UTFTest.java
tb@abc:~ > java UTFTest
Test
?Test2
tb@abc:~ >
Now the simple question is: Where does the question mark '?' in front of "Test2" come from? What can I do about it? I need to append Text to a UTF-16 encoded file in my application, keeping the file opened is no solution (although it would work in this example).
[3003 byte] By [
tbraun] at [2007-9-26 3:13:22]

The actual bytes that are written to the file (on my system) are:
FE FF 00 54 00 65 00 73 00 74 00 0D 00 0A FE FF 00 54 00 65 00 73 00 74 00 00 32 0D 00A
(for the hex illiterate: some junk, "Test" as unicode, a windows style line break, the same junk again, "Test2" as unicode and another line break.)
The question is, where do we get "FE FF", what it means and why it isn't printed the first time.
the unicode nameslist says:FEFFZERO WIDTH NO-BREAK SPACE
= BYTE ORDER MARK (BOM)
* may be used to detect byte order by contrast with FFFE which is not a character
x (<not a character> - FFFE)
x (zero width space - 200B)
So, the real problem seems to be with InputStreamReader (or its char converter) that doesn't recognize FEFF (or maybe it is designed to *not* change the used char enc on the fly?).
You'll find the bug report 4391896 interesting. Maybe mixing different encodings in one file in general is not such a good idea. (although you are using utf16 all the time...)
Maybe there's another workaround: reading the file like it was written. Read one line with one InputStreamReader and the other line with another InputStreamReader... and you can't use BufferedReader because it fills its buffer from the underlaying stream. Or making a subclass of InputStreamReader that doesn't write FEFF or FFFE to the stream...
> if( Integer.toHexString(no).equalsIgnoreCase("FEff") )
isn't it better to do the compare with ints? if (no == 0xfeff) ...
No, wait a little...
http://java.sun.com/j2se/1.3/docs/api/java/lang/package-summary.html#charenc
This is what they tell about the encoding UTF-16: "Sixteen-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark"
You get no problems if you use the encodings "UTF-16BE" or "UTF-16LE" because they wont emit the BOM.OutputStream outStream = new FileOutputStream ("test", false);
PrintWriter out = new PrintWriter (new OutputStreamWriter(outStream, "UTF-16BE"));
(reading has to be done with the same encoding, of course)