JAVA ENCODING PROBLM
Hi
I have a problem in detecting the encoding of content of any file at runtime.
to genrate the problem please follow the steps:
1. create a text file with wordpad(Windows) with just 1 or 2 line
for example:
JAVA PROBLEM
2. save this file with name test1.txt by selecting save type as Text Document.
now save same data as name test2.txt but selecting save type as Unicode text Document.
3. Now follow the following program:
import java.io.*;
public class ReadFile
{
public static void main(String [] args) throws Exception
{
//InputStreamReader aInputStreamReader = new InputStreamReader(new FileInputStream("test.txt"),"UTF-16");
InputStreamReader aInputStreamReader = new InputStreamReader(new FileInputStream("test1.txt"));
BufferedReader aReader = new BufferedReader(aInputStreamReader);
String aStr = "";
while((aStr = aReader.readLine())!=null)
System.out.println(aStr);
}
}
4. Run the program and see the output of program it will be:
JAVA PROBLEM
5. Now replace the file by test2.txt and again Run the program and see the output of program it will be like:
JAVA PROBLEM
6. Now repeate the same problem by using the commented InputStreamReader (First line of main method)
you will see when we read test1.txt it gives Exception and with test2.txt now it print properly
JAVA PROBLEM. Similerly if I use UTF-8 encoding the problem get reversed.
Now is their any way that i can detect the content of file is saved in which type. or can get the
file type, so that i can take the dicision at runtime that which contructor should i use for InputStreamReader.
Or any other encoding that i can choose that work for both.
I have tried all BufferedReader and InputStream but I got the same problem.
kapil
[2024 byte] By [
kapil_ji] at [2007-9-26 4:28:01]

Reading through, that looks pretty freaky.
I could really only put it down to the utterly horrendous excuse for Unicode support that most versions of Windows use. About the only one that even comes close to decent Unicode support is Windows 2000. I'd be guessing if I said you were running something akin to Win98, but I'd imagine you are.
Java also supports a couple of other encodings that may be worth a shot. They are essentially cut-down versions of UTF-8 that are probably only in there for the fact that earlier Windows versions are stuck in these encodings.
Try "ISO-8859-1" and "US-ASCII" in place of "UTF-8" or "UTF-16". I can't say straight out whether it would work or not, but old Windows versions are likely stuck in ISO-8859-1 (you'll often see Windows describe it by its alter ego "ISO-LATIN-1"). This is basically UTF-8 cut down to the first Unicode Latin block only, AFAIK.
Let me know how it goes!
Martin Hughes
Hi Kapil,
1) actually, the behavior you report is exactly what you would expect: presumably, when Microsoft says they are saving in unicode they mean UTF-16. The default Microsoft text encoding is Cp1252 which is the same encoding that FileInputStream normally uses if you call it without an encoding argument.
2) in general, there is no way to know what encoding was used to store characters into a file. There may be some particular cases where you can tell, but I'm not sure.
3) in general, you must use exactly the encoding that stored the data in order to correctly retrieve it. The only exception I know about is that you can use UTF8 decoding on a file that contains only characters from \u0000 to \u007f even if the file was encoded using ASCII or Cp1252. (This is one of the UTF8 design features).
Regards,
Joe
joefk at 2007-6-29 17:38:10 >
