ISO-8859-1: %22 equals & # 3 4 ; ?

NOTE: In this topic, & # 3 4 ; should contain no spaces. But when I remove the spaces, it automatically converts it to " , so that's why I put spaces.

Hey all,

Sorry for the 'difficult' topic title. I experience a problem when I try to parse HTML downloaded from imdb.com.

I have this code to search on IMDB:

URL imdbURL =new URL("http://us.imdb.com/Find?" + URLEncoder.encode(page1.getSearch(),"ISO-8859-1"));

Then, I parse the incoming HTML in this bufferedreader:

BufferedReader in =new BufferedReader(new InputStreamReader(imdbURL.openStream(),"ISO-8859-1"));

Problem I experience, is that when I search for "Mr. Bean", I get a title returned called: & # 3 4 ;Mr. Bean& # 3 4 ; (when you check the website, it displays as "Mr. Bean"). So, on the website, the & # 3 4 ; is translated into " , just like the ISO-8859-1 is meant to do. But my InputStreamReader SHOULD do the same, but it doesn't.

When I search for: " , I can see that my encoding line encoded the " into %22. So this is probably the reason it doesn't convert & # 3 4 ; into ", because it will only convert %22 into ". Does anyone know how I can solve this problem?

Thanks in advance!!

[1416 byte] By [Nemesish3da] at [2007-11-27 6:20:14]
# 1

> But my InputStreamReader SHOULD do the same, but it doesn't.

InputStreamReader doesn't convert URL encoding ...

> When I search for: " , I can see that my encoding

> line encoded the " into %22. So this is probably the

> reason it doesn't convert & # 3 4 ; into ", because

> it will only convert %22 into ". Does anyone know how

> I can solve this problem?

" is an HTML entity, whereas %22 is URL encoding. Could you please explain where you get what code which you want to have decoded into what other code?

quittea at 2007-7-12 17:35:17 > top of Java-index,Java Essentials,Java Programming...
# 2

Thanks for your reply,

Using the bufferedreader, I read the content line by line. At some stage, I get to this line:

<b>Popular Titles</b> (Displaying 3 Results)<ol><li> <a href="/title/tt0096657/" onclick="set_args('tt0096657',1,1)">"Mr. Bean"</a> (1990)</li>

As you can see (or maybe not because this forum automatically parses it automatically), the title is & # 3 4 ;Mr.Bean& # 3 4; . This is also the String my JList displays when I put it in there. So what I want is, of course, that that String will replace that part with " . Of course I could achieve it using:

sName = sName.replaceAll("& # 3 4 ;", "\"");

But, maybe there are more 'codes' like this in the webpage, so I want all the html entity's to be parsed automatically.

Hopefully I made myself clear.

Regards,

Peter

Nemesish3da at 2007-7-12 17:35:17 > top of Java-index,Java Essentials,Java Programming...
# 3
I'd strongly suggest to use a HTML parser, e.g. http://people.apache.org/~andyc/neko/doc/html/ http://jtidy.sourceforge.net/
quittea at 2007-7-12 17:35:17 > top of Java-index,Java Essentials,Java Programming...
# 4
okay thanks, I will try one of those.
Nemesish3da at 2007-7-12 17:35:17 > top of Java-index,Java Essentials,Java Programming...
# 5

I tried an HTML parser, but even that didn't convert the ISO-8859-1 characters to 'real' characters. So I created a function which does this:

public static String parseString(String string) {

for(int iCounter=0; iCounter < 256; iCounter++)

string = string.replaceAll("&#" + iCounter + ";", Character.toString((char)iCounter));

return string;

}

It's probably pretty processor-consuming, but my program will only converts about 4 lines each time you search.

Nemesish3da at 2007-7-12 17:35:17 > top of Java-index,Java Essentials,Java Programming...