Cannot read foreign characters!!!

Requirement: Read a text file containing foreign characters and print it/ store it in string buffer

Problems: Doesnt work, and throws exception whenever Reader.read() or BufferedReader.readline()

method is envoked...

How do i get arround this?

Here is a sample text from the file...

Angende utleggelse av kopierte lydfiler i CVS8 format via internettf fjernet p denne e-mail ndt tilbemerke

.........................................................................

here is the sample program:

import java.io.*;

public class LoadUTF8 {

private static String _FILE = null;

public static void main(String args[]) {

_FILE = args[0];

try {

System.out.println(getUTF8());

}

catch(Exception e) {e.printStackTrace();}

try {

loadUTF8Data();

}

catch(Exception e) {e.printStackTrace();}

}

public static String getUTF8() {

StringBuffer buffer = new StringBuffer();

try {

FileInputStream fis = new FileInputStream(_FILE);

InputStreamReader isr = new InputStreamReader(fis, "UTF8");

Reader in = new BufferedReader(isr);

int ch;

while ((ch = in.read()) > -1) {

buffer.append((char)ch);

}

in.close();

return buffer.toString();

} catch (IOException e) {

e.printStackTrace();

return null;

}

}

public static void loadUTF8Data() {

try

{

InputStream in = new FileInputStream(_FILE);

BufferedReader reader = new BufferedReader(new InputStreamReader(in, "UTF8"));

String line = null;

StringBuffer buffer = new StringBuffer();

while ( ( line = reader.readLine() ) != null )// Read line until EOF

{

buffer.append( line + "\n" );

}

System.out.println("" + buffer.toString() + "");

}

catch(Exception e) {

e.printStackTrace();

}

}

}

/*

Error Message:

sun.io.MalformedInputException

at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:110)

at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)

at java.io.InputStreamReader.fill(InputStreamReader.java:186)

at java.io.InputStreamReader.read(InputStreamReader.java:249)

at java.io.BufferedReader.fill(BufferedReader.java:139)

at java.io.BufferedReader.read(BufferedReader.java:157)

at mtrack.cda.utils.LoadUTF8.getUTF8(LoadUTF8.java:25)

at mtrack.cda.utils.LoadUTF8.main(LoadUTF8.java:15)

NB:- It works fine with normal text files that doesn't have any foreign character!

If either of getUTF8() or loadUTF8Data() functions work, i will be very happy!!! :-)

*/

[2760 byte] By [pratimdas] at [2007-9-26 3:39:20]
# 1

hi,

I didn't try out your actual code, but the most likely explanation for the exception you are seeing is that the file you are reading is not encoded in UTF8. ASCII characters are encoded the same in UTF8 or Cp1252, but the Cp1252 bytes representing international characters (code points between 0x0080 and 0x00ff) will produce exactly the exception you are seeing when you try and read them in as UTF8.

Post back if you need more clarification.

Regards,

Joe

joefk at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...
# 2
Thanks, I understand what u saying! But could you recommend any other alternative to read a text from a file containing foreign characters and store it in a StringBuffer?CheersPratim
pratimdas at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...
# 3

Hi Pratim,

Your code is absolutely correct and can be adapted to read in any foreign characters as long as you create your InputStreamReader with the same encoding that was used to write the file you are trying to read. By default, java encodes files in a format called Cp1252. Cp1252 can be used to encode ASCII characters (\u0000->\u007f) + an extended region (\u0080->\u00ff). This extended region contains, among other characters, the extra characters used by languages like Spanish, German etc. To read such a file, change your code to:

InputStreamReader isr = new InputStreamReader(fis, "Cp1252");

However, if one wants to encode characters beyond \u007f (for example, Chinese, Hindu, Russian etc.) then one cannot use Cp1252 encoding. Frequently, such characters are encoded using UTF8 (Unicode Transformation Format 8). UTF8 is capable of encoding the entire unicode character set. There are many many different encodings supported by java. The point is that you must to use the same encoding to read back the characters as was used to store them.

I suggest that you experiment with writing characters to files and reading them back into java using different encodings. You already have the code to read the characters. To write the files, you can either use Microsoft Word or you can write code analogous to your code above except using the OutputStreamWriter. It will also be helpful to you if you have some way of examining the bytes in your output files.

Regards,

Joe

joefk at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...
# 4
Error Correction:in my previous message, the second paragraph should begin with the statement:"However, if one wants to encode characters beyond \u00ff (for example, Chinese, Hindu, Russian etc.)"sorry,Joe
joefk at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...
# 5

Hi Joe,

I tried using the Cp1252 encoding, and this time it worked, but converted some characters in its own way. Angende was converted into Angσende (maybe you wont be able to see the distiction as the forums db might have the same problem)!

But as you pointed out that, UTF8 is the superset of all these, then could you tell me why is it not working with UTF8 but with Cp1252.

Thanks in advance!

Pratim

pratimdas at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...
# 6

Hi Pratim,

1)>>Angende was converted into Ang?ende<<

Most likely your string was correctly formed but your standard output (a DOS box?) is incapable of displaying the non-ASCII character.

2)>>But as you pointed out that, UTF8 is the superset of all these<<

Superset of what? Where did you get this idea from?

Regards,

Joe

joefk at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...
# 7

Joe,

I am about to bring a very serious problem to you, it seems you have done plenty of work in internationalization:

Well, I am running an J2EE application with JRun 3.1, bascially the jsp page has a text area that contains foreign characters (noreigien, german etc). Now it sends the data to a servlet, and the servlet gets it as a String using gerParameter().

The problem is, when I do the getParamter(paramName), the foreign characters are all replaced by '?' marks. I really dont know what to do, but what I did was...

1. I used

<%

response.setContentType("text/html;charset=ISO8859_1");

%> in the JSP page, but I have an debug message that prints out the character encoding as receieved by the servlet. It comes out as null, and it doesnt work.

2. I used

<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO8859_1"> this didnt work either.

3. Also, I used request.setCharacterEncoding("ISO8859_1") in the servlet, but I had to recompile my code using servlet api 2.3. Now when i deploy the code in JRun there is no problem, but at runtime, it come with 500 Internal Server Error - no such method, as JRun 3.1 supports only servlets 2.2

So, any other suggestion? what can I do using servlets 2.2 ....

Regards

Pratim

pratimdas at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...
# 8

Hi Pratim,

I'd like to help but I have never used JSP or servlets and this looks like an application where you have to be very careful about the tags.

Sorry,

Joe

P.S. I was going to recommend that you go over to the JavaServer Pages[tm] forum, but I see that there's more questions then answers over there related to characters and JSP.

joefk at 2007-6-29 12:13:53 > top of Java-index,Desktop,I18N...