Failing to read Polish characters from unicode
I am experiencing a problem reading Polish characters from a csv file. The file contains comma delimited unicode text and displays the data correctly in notepad.
However, (these special characters are not displaying in this forum) the LATIN CAPITAL LETTER L WITH STROKE and LATIN CAPITAL LETTER Z WITH ACUTE characters are read as? Codepage 12503F. It should display8F for LATIN CAPITAL LETTER Z WITH ACUTE andA3 for the LATIN CAPITAL LETTER L WITH STROKE.
I wonder if anyone has experience reading data in using this format? Any help would be greatly appreciated.
Most likely you're reading the file using the wrong encoding. Find out what encoding the file is in, then do this:Reader r = new InputStreamReader(new FileInputStream(filename), encodingname));
The other possibility is that your "display" is not working right. I'm going to have to guess on that one too, since you didn't say anything about it. If it's a GUI display and you're getting boxes then your font can't render the characters. If it's the console then forget it -- unless you can actually display the file accurately on the console outside Java.
Hi, thanks for your timely response. I believe I'm doing as you suggest but still having no joy; this is my implementation:
// read in using unicode charSet
String filename = "jobcode.csv";
FileInputStream fis=new FileInputStream(filename);
BufferedReader br=new BufferedReader(new InputStreamReader(fis, "UTF-16"));
String myLine = br.readLine();
// write out the bytes recieved
byte[] bs = myLine.getBytes();
StringBuffer s = new StringBuffer();
for(int i=0; i<bs.length; i++) {
s.append(Integer.toHexString((int)bs[i]) + ",");
}
System.out.println("string value: "+myLine);
System.out.println("hex value: "+s.toString());
Best regards>
According to my copy of the Unicode characters, LATIN CAPITAL LETTER Z WITH ACUTE is U-0179. If you encode that using Codepage 1250, it's supposed to be converted to 8F, I suppose? You may or may not be using Codepage 1250: this line of codebyte[] bs = myLine.getBytes();
encodes it using your system's default charset. You can pass a charset name to the getBytes() method if you want to specify the charset it should use. So according to this document:
http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html
you could trybyte[] bs = myLine.getBytes("Cp1250");