ISO-8859-1: %22 equals & # 3 4 ; ?
NOTE: In this topic, & # 3 4 ; should contain no spaces. But when I remove the spaces, it automatically converts it to " , so that's why I put spaces.
Hey all,
Sorry for the 'difficult' topic title. I experience a problem when I try to parse HTML downloaded from imdb.com.
I have this code to search on IMDB:
URL imdbURL =new URL("http://us.imdb.com/Find?" + URLEncoder.encode(page1.getSearch(),"ISO-8859-1"));
Then, I parse the incoming HTML in this bufferedreader:
BufferedReader in =new BufferedReader(new InputStreamReader(imdbURL.openStream(),"ISO-8859-1"));
Problem I experience, is that when I search for "Mr. Bean", I get a title returned called: & # 3 4 ;Mr. Bean& # 3 4 ; (when you check the website, it displays as "Mr. Bean"). So, on the website, the & # 3 4 ; is translated into " , just like the ISO-8859-1 is meant to do. But my InputStreamReader SHOULD do the same, but it doesn't.
When I search for: " , I can see that my encoding line encoded the " into %22. So this is probably the reason it doesn't convert & # 3 4 ; into ", because it will only convert %22 into ". Does anyone know how I can solve this problem?
Thanks in advance!!
> But my InputStreamReader SHOULD do the same, but it doesn't.
InputStreamReader doesn't convert URL encoding ...
> When I search for: " , I can see that my encoding
> line encoded the " into %22. So this is probably the
> reason it doesn't convert & # 3 4 ; into ", because
> it will only convert %22 into ". Does anyone know how
> I can solve this problem?
" is an HTML entity, whereas %22 is URL encoding. Could you please explain where you get what code which you want to have decoded into what other code?
Thanks for your reply,
Using the bufferedreader, I read the content line by line. At some stage, I get to this line:
<b>Popular Titles</b> (Displaying 3 Results)<ol><li> <a href="/title/tt0096657/" onclick="set_args('tt0096657',1,1)">"Mr. Bean"</a> (1990)</li>
As you can see (or maybe not because this forum automatically parses it automatically), the title is & # 3 4 ;Mr.Bean& # 3 4; . This is also the String my JList displays when I put it in there. So what I want is, of course, that that String will replace that part with " . Of course I could achieve it using:
sName = sName.replaceAll("& # 3 4 ;", "\"");
But, maybe there are more 'codes' like this in the webpage, so I want all the html entity's to be parsed automatically.
Hopefully I made myself clear.
Regards,
Peter
I tried an HTML parser, but even that didn't convert the ISO-8859-1 characters to 'real' characters. So I created a function which does this:
public static String parseString(String string) {
for(int iCounter=0; iCounter < 256; iCounter++)
string = string.replaceAll("&#" + iCounter + ";", Character.toString((char)iCounter));
return string;
}
It's probably pretty processor-consuming, but my program will only converts about 4 lines each time you search.