UTF-8 file encoding problem

I am trying to read and write XML files in UTF-8 encoding.

If I first do it in ANSI my program works fine.

Next I safe the file in UTF-8 encoding with notepad.

If I read it in ANSI I have some garbage characters as expected.

But when I read it in UTF-8:

new BufferedReader(new InputStreamReader(new FileInputStream(file),"UTF8")).read(buf);

and I echo the contents to my screen, is starts with a ? (question mark)

If I write the same contents in UTF-8 back to a file:

new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(OUTPUT_DIR + files[i])),"UTF8")).write(data);

Notpad knows it is a UTF-8 file, but has a square character (unkown char) at the start.

Wordpad translates that to 2 garbage characters, wich are not the normal garbage you get from utf-8.

Also if I add a \n (new line), notpad translates those also to unkown characters (only the ones I added), and wordpad does read the newlines correct.

Why do I still get garbage when I read a UTF-8 file as a UTF-8 file?

Why do I have 2 different newline characters?

I tryed it with multiple files, al the same.

[1457 byte] By [mivz@spugium.neta] at [2007-10-3 5:26:33]
# 1

The extra characters in the beginning of the file may be a "byte order marker" but it's hard to say without any technical details. try to get a hex dump of the file.

> Also if I add a \n (new line), notpad translates those also to

> unkown characters (only the ones I added),

this is because notepad is stupid; it only recognizes DOS line breaks (which are "\r\n") and displays unix line breaks ("\n") as unknown

jsalonena at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 2

Yes, that is the BOM. In first instance, but when I read the file as UTF-8 it should not put the BOM in the String I read, right? If I just strait read and write it, nothing goes wrong, but when I edit the data:

But mabey I read it wrong, cause it is the BOM:

char[] buf = new char[(int)file.length()];

new BufferedReader(new InputStreamReader(new FileInputStream(file),"UTF8")).read(buf);

data = new String(buf);

Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(OUTPUT_DIR + files[i])), "UTF8"));

writer.write("TEST\nTEST\n"+data);

writer.close();

Notice the "TEST\nTEST\n" I add at the beginning of the file.

Results:

Notepad: TEST [square](new line) TEST [square](new line) [square](BOM) rest-XML

Ultraedit:

T

TEST

[space] rest-XML

wordpad:

TEST

TEST

锘?rest-XML

HEX:

00000000h: FF FE 54 00 0A 00 54 00 45 00 53 00 54 00 0A 00 ; ...T.E.S.T...

00000010h: FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 ; ?lt;.?.x.m.l. .v.

So if I understand correctly, the BOM stays at the start of the String, even when I read the file in UTF-8...

Do I do something wrong with read(char[])?

Or mabey a bug in read(char[]) for UTF-8 file reading?

BufferedReader.readLine() also does the same and reads the BOM into the String, when using UTF-8 encoding.

Message was edited by:

mivz@spugium.net

mivz@spugium.neta at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 3

That file is UTF-16, not UTF-8.

UTF-8 files can have a BOM, but it's not FFFE or FEFF, and it doesn't actually affect the byte-order of the file.

Java Readers don't deal with BOM in UTF-8 files, IIRC. But they get even more confused when you present them with a UTF-16 stream containing a UTF-16 BOM and then ask them to treat it as UTF-8! :-)

http://www.unicode.org/faq/utf_bom.html#BOM

dannyyatesa at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 4
You are right:MS Notpad... I open or create a file, save as... UTF-8 option selected.Open hex editor: UTF-16 BOM ?!!!They finaly add encoding options.... they do it wrong.
mivz@spugium.neta at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 5
Woeps... All my UTF-8 files get automaticly converted to UTF-16....Not only notpad's, but eclipse, download ones, ultra edit files.I think it's in the OS. I will report more later.
mivz@spugium.neta at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 6

There seem to be a bug in core API when reading UTF-8 files.

I have a simple UTF-8 file that only contains 't' and a new line. The hex code of it is shown below (VIM HEX mode):

0000000: efbb bf74 0d0a...t..

We can see the first three bytes "EF BB BF" are the signiture of UTF-8. "74" is the letter 't'.

Then I use the following code to printout the content:

import java.io.*;

public class FTest implements java.io.Serializable

{

public static final String ENCODING = "UTF-8";

public static void main( String[] args ) throws Exception

{

BufferedReader reader =new BufferedReader(new InputStreamReader(new FileInputStream(args[0]), "UTF-8"));

int lineCount = 0;

while(true)

{

String line = reader.readLine();

if(line == null) break;

System.out.println( "line " + ++lineCount + " is <" + line.trim()+ ">");

}

reader.close();

}

}

And the output is :

line 1 is <?t>

There's is an unexpected "?" before "t".

It mistakenly shows the last byte of UTF BOM as "?". This probably is a bug somehere in Java IO API.

The test environment is JDK1.4.2_12-b03 and 1.5.0_07-b03 on Windows XP sp2.

You can use the Windows notepad to create a new file and save it as UTF-8 format. There's no problem is I run the same code and file on Linux.

Thanks.

ybzhenga at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 7
Yes, thank you. I new I was not crazy.Did you report the bug or should I?If so, what is it's number so I can track it?
mivz@spugium.neta at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 8
I already said that the Java libraries don't deal with UTF-8 BOMs. This could be considered a bug, but, to be fair, I've never actually encountered a UTF-8 file with a BOM in the wild!
dannyyatesa at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 9

Notepad, Eclipse and Ultra Edit have support for it..

Java has support for UTF-8 source files.

XML files: <?xml version='1.0' encoding='utf-8'?> should be able to have a BOM right?

If you use Windows you never will, becaus it wil be automaticly converted to UTF-16 making al UTF-8 files as twice as big as needed. (for european languages) I am still looking for a solution to this problem.

I think if we talk about encoding, wich is defined as a standard we should support the standarde and not only support what 'I' encounter. Why would we define a standard? So you know what to expect for al cases coverd by the standard. Right?

Now it behaves unexpected and not according to what is documented else where.

mivz@spugium.neta at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 10
> Now it behaves unexpected and not according to what> is documented else where.Welcome the world of MS!
sabre150a at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 11
And Sun?
mivz@spugium.neta at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 12
> And Sun?Yes, but to a much lesser extent.
sabre150a at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 13

Hi,

This bug has nothing to do with MS. It's SUN's problem. Please don't blame MS blindly.

The UTF-8 file conforms to the standard, with the first 3 bytes being "EF BB BF". The bug is SUN's Java API (JDK1.4 & 1.5, maybe older versions too) can't read UTF-8 file correctly on Windows (but not on Linux). It treats the last byte of BOM ('BF') as part of the body content so that readLine returns the first line always starting with "?".

I'm not familiar with SUN's bug tracking system. Could somebody do it for us? Thanks!

ybzhenga at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...
# 14

The UTF-8 BOM bug has already been reported and the resolution was "will not be fixed." http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

The OP seems to be having problems having his UTF-8 files converted to some flavour of UTF-16 on Windows (as weird as it sounds). That has nothing to do with the byte order marker.

jsalonena at 2007-7-14 23:33:48 > top of Java-index,Core,Core APIs...