0x3F character on the first line when reading Unicode files

Hello,

My application must be capable of reading Unicode files (UTF-8, UTF-16BE and UTF-16LE).

To read the unicode file i'm using BufferedReader object (something like that:)

BufferedReader in =new BufferedReader(

new InputStreamReader(new FileInputStream("infilename"),"UTF8"));

String str = in.readLine();

etc...

The problem is that when i read a unicode file, the fist character of the first line is always 0x3F.

Can you tell me where is the problem and what am i missing ?

thanks'

Adrian

[711 byte] By [adyca] at [2007-10-3 4:22:28]
# 1
Your files must contain some garbage before the actualy text, possibly a BOM (byte order marker). Try examining them in a hex viewer to find out if this is true. To read UTF-16 files that have a BOM use the "UTF-16" encoding--it detects the byte order from the marker.
jsalonena at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 2
Hi again,than's for your quick response;All my files have BOM, but normally the reader is capable to skip the 2 bytes (for the UTF-16, BE LE) or 3 bytes (UTF-8) f the BOM and the BOM does not contain 0x3F character.Adrian
adyca at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 3
How do you detect the U+003F character?If you use UTF-16BE or UTF-16LE instead of UTF-16, the BOM is ignored and passed in as a normal character. This means that the stream would start with either \ufeff or \ufffd both of which would be printed as '?' or U+003F.
jsalonena at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 4

I can see that in the debugger:

BufferedReader in = new BufferedReader(

new InputStreamReader(new FileInputStream("infilename"), "UTFXX"));

String str = in.readLine();

byte[] bytes = str.getBytes();

This same character is present on byte[0] if i'm reding an UTF-8, 16BE or 16LE file.

adyca at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 5

what is the default encoding? i suspect that the 0x3f byte (note: not character) is produced from converting the byte order marker to bytes. the solution is getting rid of the bom.

to do that, you can use utf-16 in java instead of utf-16le and utf-16be.

with utf-8 it seems that the only way to get rid of the bom is to discard the first character (i.e. call reader.read() once before calling readLine)

jsalonena at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 6
Hi again,I already did that but for me it's not a clean solution :) From my point of view the problem is in the FileInputStream class; it should be aware about the presence of the BOM.Adrian
adyca at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 7
but it's not and you can't change itif you use UTF-16 (not LE nor BE) the BOM is interpreted and removed from the streamif you use UTF-16LE, UTF-16BE, or UTF-8, the BOM is not needed and is left in the stream
jsalonena at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 8
Indeed when i use UTF-16 it's ok, i have no BOOM in the exit stream; but the problem is for UTF-8.The bom for UTF-8 is 0xEF 0xBB 0xBF but in the stream i have a 0x3F byte; i wonder if it's not a bug in the InputStreamReader class.ThanksAdrian
adyca at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...
# 9
i found that : http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058It'a a bug and will not be fixed for the compatibility reasons.
adyca at 2007-7-14 22:24:45 > top of Java-index,Core,Core APIs...