New To Java - Need to read Unicode in a file

Hi,

My need to read Unicode from a file (on a Windows box) is due to the fact my software is used in different countries and on different keyboard naturally. All the users are not computer literate but, like me, they are all lazy and want to put their username and password in a config file my application reads. If their username or password contain Unicode characters I have a problem reading.

They are simple users that I would like to advise them to open the config file using Windows Notepad, then type in their username and password, and save the file as Unicode. Notepad has four ways to save a file, ANSI, Unicode, Unicode big endian, and UTF-8 (I've tried them all except ANSI of course). Saving a file in a different format is as complicated as I would like it to get for them, some will have trouble even with this.

I read the file like so:

BufferedReader rdr =

new BufferedReader(

new InputStreamReader(new FileInputStream(file_name),"UTF-16"));

String line;

while ((line = rdr.readLine()) !=null){

String[] pieces = line.split("[=:]");

if (pieces.length == 2){

if (pieces[0].equals("PASSWORD")){

byte[] possibleUnicode = pieces[1].getBytes("various encodings");

pieces[1] =new String(possibleUnicode,"various encodings");

}

propertyTable.setProperty(pieces[0], pieces[1]);

}

}

All reading is perfect except for a username or password which can contain a real multi-byte character. I have used many variations of converting the string I get into a byte[] using string_in.getBytes("various encodings tried") and then back to a string but nothing has worked.

I tried a regular FileReader to a BufferedReader and that didn't work. I tried a FileInputStreamto a DataInputStream and that didn't work. I accomplished the most with what I described above, FileInputStream to InputStreamReader to BufferedReader.

Does anyone know how to read Unicode in a file on a Windows file system?

hopi

[2736 byte] By [hopia] at [2007-11-26 23:26:08]
# 1

You can't change the encoding of a String. If you read the file using the correct encoding, you shouldn't have to do anything else. If you read it with the wrong encoding, the resulting string will be corrupt, and there's no way to fix it. Specifically, the getBytes() methods and the String constructors have no place in your program; do yourself a favor and try to forget they even exist. Tell your users to save the file as UTF-8, and specify UTF-8 when you create the InputStreamReader, and you should be fine.

uncle_alicea at 2007-7-10 14:33:56 > top of Java-index,Java Essentials,New To Java...
# 2

Thank you for responding but I still have no better output.

You can't change the encoding of a String. If you

read the file using the correct encoding, you

shouldn't have to do anything else. If you read it

with the wrong encoding, the resulting string

will be corrupt, and there's no way to fix it.

I have used the byte conversion technique before successfully when I loaded a set of properties from a URL openStream(). The properties load() method takes an InputStream and assumes ISO-8859-1 so I converted the bytes from ISO-8859-1 to UTF-8. Garbage characters were cleared up perfectly.

Specifically, the getBytes() methods and the String

constructors have no place in your program; do

yourself a favor and try to forget they even exist.

Tell your users to save the file as UTF-8, and

specify UTF-8 when you create the InputStreamReader,

and you should be fine.

I have repeated my steps and still cannot read the Unicode characters.

1) the file is saved by Notepad as UTF-8

2) the InputStreamReader is reading UTF-8

Results: All text is correct except for the Unicode characters.

I surely do appreciate the suggestion and follow up if you have one.

hopi

hopia at 2007-7-10 14:33:56 > top of Java-index,Java Essentials,New To Java...
# 3

I have used the byte conversion technique before

successfully when I loaded a set of properties

from a URL openStream(). The properties load()

method takes an InputStream and assumes ISO-8859-1

so I converted the bytes from ISO-8859-1 to UTF-8.

Garbage characters were cleared up perfectly.

I think you just got lucky that time. For characters up to U+007F, the UTF-8 encoding is the same as ISO-8859-1 (and most other encodings, for that matter). Characters in the range U+0080 to U+00FF will be encoded with one byte in ISO-8859-1, and with two bytes in UTF-8. In most cases, each of the two bytes in the UTF-8 representation will have values that are valid in ISO-8859-1. The decoded characters will be incorrect (and there will be too many of them), but they effectively preserve the original byte values, making it possible for you to re-encode the characters and then decode them correctly. But there's a big gap in the middle where the UTF-8 bytes produce garbage when decoded as IS)-8859-1. Run the included program to see what I mean.

I don't know what's going wrong with your application, but I do know that changing the encoding retroactively is not the solution. I also think you're right about asking users save files in a certain encoding. Considering how much trouble programmers have with this stuff, it's definitely too much to ask of users.

import java.awt.Font;

import javax.swing.*;

public class Test

{

public static void main(String... args) throws Exception

{

JTextArea ta = new JTextArea();

ta.setFont(new Font("monospaced", Font.PLAIN, 14));

JFrame frame = new JFrame();

frame.add(new JScrollPane(ta));

frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);

StringBuilder sb = new StringBuilder();

for (int i = 0xA0; i <= 0xFF; i++)

{

sb.append((char)i);

}

String str1 = sb.toString();

byte[] utfBytes = str1.getBytes("UTF-8");

String str2 = new String(utfBytes, "ISO-8859-1");

for (int i = 0, j = 0; i < str1.length(); i++, j += 2)

{

char ch = str1.charAt(i);

byte b1 = utfBytes[j];

byte b2 = utfBytes[j+1];

String s1 = Integer.toBinaryString(b1 & 0xFF);

String s2 = Integer.toBinaryString(b2 & 0xFF);

char ch1 = str2.charAt(j);

char ch2 = str2.charAt(j+1);

ta.append(String.format("%2c%10s%10s%3x%3x%3c%3c\n",

ch, s1, s2, b1, b2, ch1, ch2));

}

frame.setSize(400, 700);

frame.setLocationRelativeTo(null);

frame.setVisible(true);

}

}

uncle_alicea at 2007-7-10 14:33:56 > top of Java-index,Java Essentials,New To Java...