Cannot read foreign characters!!!
Requirement: Read a text file containing foreign characters and print it/ store it in string buffer
Problems: Doesnt work, and throws exception whenever Reader.read() or BufferedReader.readline()
method is envoked...
How do i get arround this?
Here is a sample text from the file...
Angende utleggelse av kopierte lydfiler i CVS8 format via internettf fjernet p denne e-mail ndt tilbemerke
.........................................................................
here is the sample program:
import java.io.*;
public class LoadUTF8 {
private static String _FILE = null;
public static void main(String args[]) {
_FILE = args[0];
try {
System.out.println(getUTF8());
}
catch(Exception e) {e.printStackTrace();}
try {
loadUTF8Data();
}
catch(Exception e) {e.printStackTrace();}
}
public static String getUTF8() {
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream(_FILE);
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
}
in.close();
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
public static void loadUTF8Data() {
try
{
InputStream in = new FileInputStream(_FILE);
BufferedReader reader = new BufferedReader(new InputStreamReader(in, "UTF8"));
String line = null;
StringBuffer buffer = new StringBuffer();
while ( ( line = reader.readLine() ) != null )// Read line until EOF
{
buffer.append( line + "\n" );
}
System.out.println("" + buffer.toString() + "");
}
catch(Exception e) {
e.printStackTrace();
}
}
}
/*
Error Message:
sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:110)
at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)
at java.io.InputStreamReader.fill(InputStreamReader.java:186)
at java.io.InputStreamReader.read(InputStreamReader.java:249)
at java.io.BufferedReader.fill(BufferedReader.java:139)
at java.io.BufferedReader.read(BufferedReader.java:157)
at mtrack.cda.utils.LoadUTF8.getUTF8(LoadUTF8.java:25)
at mtrack.cda.utils.LoadUTF8.main(LoadUTF8.java:15)
NB:- It works fine with normal text files that doesn't have any foreign character!
If either of getUTF8() or loadUTF8Data() functions work, i will be very happy!!! :-)
*/
[2760 byte] By [
pratimdas] at [2007-9-26 3:39:20]

Hi Pratim,
Your code is absolutely correct and can be adapted to read in any foreign characters as long as you create your InputStreamReader with the same encoding that was used to write the file you are trying to read. By default, java encodes files in a format called Cp1252. Cp1252 can be used to encode ASCII characters (\u0000->\u007f) + an extended region (\u0080->\u00ff). This extended region contains, among other characters, the extra characters used by languages like Spanish, German etc. To read such a file, change your code to:
InputStreamReader isr = new InputStreamReader(fis, "Cp1252");
However, if one wants to encode characters beyond \u007f (for example, Chinese, Hindu, Russian etc.) then one cannot use Cp1252 encoding. Frequently, such characters are encoded using UTF8 (Unicode Transformation Format 8). UTF8 is capable of encoding the entire unicode character set. There are many many different encodings supported by java. The point is that you must to use the same encoding to read back the characters as was used to store them.
I suggest that you experiment with writing characters to files and reading them back into java using different encodings. You already have the code to read the characters. To write the files, you can either use Microsoft Word or you can write code analogous to your code above except using the OutputStreamWriter. It will also be helpful to you if you have some way of examining the bytes in your output files.
Regards,
Joe
joefk at 2007-6-29 12:13:53 >

Joe,
I am about to bring a very serious problem to you, it seems you have done plenty of work in internationalization:
Well, I am running an J2EE application with JRun 3.1, bascially the jsp page has a text area that contains foreign characters (noreigien, german etc). Now it sends the data to a servlet, and the servlet gets it as a String using gerParameter().
The problem is, when I do the getParamter(paramName), the foreign characters are all replaced by '?' marks. I really dont know what to do, but what I did was...
1. I used
<%
response.setContentType("text/html;charset=ISO8859_1");
%> in the JSP page, but I have an debug message that prints out the character encoding as receieved by the servlet. It comes out as null, and it doesnt work.
2. I used
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO8859_1"> this didnt work either.
3. Also, I used request.setCharacterEncoding("ISO8859_1") in the servlet, but I had to recompile my code using servlet api 2.3. Now when i deploy the code in JRun there is no problem, but at runtime, it come with 500 Internal Server Error - no such method, as JRun 3.1 supports only servlets 2.2
So, any other suggestion? what can I do using servlets 2.2 ....
Regards
Pratim