Problem reading scandinavian characters (with inputstreamreader)

Hi,

My problem is as follows.

My application reads a .html file from the URL specified by the user, removes all the tags and displays the plain text in a JTextArea. Most of the time everything works out allright, and everybody's happy. However, if the .html page happens to contain characters such as ( a with dots) or (o with dots), we run into trouble.

All the scandinavian characters are displayed as (ampersand)auml's or (ampersand)ouml's , instead of displaying the actual characters. What might be the problem?

Here's the method I use to read the URL :

void ReadURL(String fName) {

try {

URLConnection uConn = null;

InputStreamReader in;

BufferedReader buffer;

String line;

StringBuffer sBuffer = new StringBuffer();

uConn = new URL(fName).openConnection();

uConn.connect();

in = new InputStreamReader(uConn.getInputStream());

System.out.println(in.getEncoding());

buffer = new BufferedReader(in);

boolean eof = false;

while ((line = buffer.readLine()) != null)

sBuffer.append(line + "\n");

buffer.close();

String text = parser.parseString(sBuffer.toString());

textScreen.setText(text);

} catch (IOException e) {

System.out.println("Error -- " + e.toString());

}

}

--

I'm guessing the problem is with the inputStreamReader, as reading

normal text files with the FileReader works allright. However, I have

no clue how to fix the problem.

Any ideas?

Thank you for any help,

Ossi

[1614 byte] By [oNiiranen] at [2007-9-26 7:12:59]
# 1

"ä" is an entity code for ä and that's what the source actually has (nothing to do with Java); there's nothing you can do but just parse them. It's pretty straight - the code always starts with a & and ends with a semicolon (;) or white space (that's invalid html but that's what some sites have). You'll find the complete list from http://www.w3.org/TR/REC-html40/sgml/entities.html

You'll also have to look for things like &#XXX; - for example é is an entity code for é (233 is the unicode value of é in decimal).

jsalonen at 2007-7-1 16:57:19 > top of Java-index,Archived Forums,New To Java Technology Archive...
# 2

That is not a "problem", that is the standard way to represent those characters in HTML. Your browser will display "ä" as "ä". If you are trying to interpret HTML you must do the same thing. Here is a page listing many HTML escape sequences:

http://wdvl.com/Quadzilla/Character_Set/

DrClap at 2007-7-1 16:57:19 > top of Java-index,Archived Forums,New To Java Technology Archive...
# 3
Here's another great resource: http://www.cs.tut.fi/~jkorpela/webjulk/2.6.html(unfortunately it's in Finnish but I think the OP will find it useful)
jsalonen at 2007-7-1 16:57:19 > top of Java-index,Archived Forums,New To Java Technology Archive...
# 4
Thank you both. The app is working nicely, now that I've "seen the light". I haven't got that much experience in HTML, so I wasn't aware of the concept of entities. Thanks again,- Ossi
oNiiranen at 2007-7-1 16:57:19 > top of Java-index,Archived Forums,New To Java Technology Archive...