0x3F character on the first line when reading Unicode files
Hello,
My application must be capable of reading Unicode files (UTF-8, UTF-16BE and UTF-16LE).
To read the unicode file i'm using BufferedReader object (something like that:)
BufferedReader in =new BufferedReader(
new InputStreamReader(new FileInputStream("infilename"),"UTF8"));
String str = in.readLine();
etc...
The problem is that when i read a unicode file, the fist character of the first line is always 0x3F.
Can you tell me where is the problem and what am i missing ?
thanks'
Adrian
[711 byte] By [
adyca] at [2007-10-3 4:22:28]

Hi again,than's for your quick response;All my files have BOM, but normally the reader is capable to skip the 2 bytes (for the UTF-16, BE LE) or 3 bytes (UTF-8) f the BOM and the BOM does not contain 0x3F character.Adrian
adyca at 2007-7-14 22:24:45 >

I can see that in the debugger:
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream("infilename"), "UTFXX"));
String str = in.readLine();
byte[] bytes = str.getBytes();
This same character is present on byte[0] if i'm reding an UTF-8, 16BE or 16LE file.
adyca at 2007-7-14 22:24:45 >

what is the default encoding? i suspect that the 0x3f byte (note: not character) is produced from converting the byte order marker to bytes. the solution is getting rid of the bom.
to do that, you can use utf-16 in java instead of utf-16le and utf-16be.
with utf-8 it seems that the only way to get rid of the bom is to discard the first character (i.e. call reader.read() once before calling readLine)
but it's not and you can't change itif you use UTF-16 (not LE nor BE) the BOM is interpreted and removed from the streamif you use UTF-16LE, UTF-16BE, or UTF-8, the BOM is not needed and is left in the stream