Need help with HTTPUrlConnection

Hi,

I am trying to read data from a france site using HTTPUrlConnection but the data is all shown as messed up to me.

Here is the code i am using, any help will be highly appreciated

URL urlObject = new URL( "http://localhost:8000/i18n.htm" );

HttpURLConnection con = (HttpURLConnection)urlObject.openConnection();

con.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322)" );

con.setRequestProperty( "Pragma", "no-cache" );

System.out.println("Connection is using cache true ? false -> " + con.getUseCaches());

con.setUseCaches(false); // do not use caching

con.setRequestProperty( "Host", urlObject.getHost() );

InputStream in = new BufferedInputStream( con.getInputStream());

BufferedReader bin = new BufferedReader(new InputStreamReader( inputStream, "ISO-8859-1" ));

String line = bin.readLine();

StringBuffer sb = new StringBuffer();

while (line != null) {

sb.append(line);

sb.append("\r\n");

line = bin.readLine();

}

System.out.println("The string we have is " + sb.toString());

This is the original content,

Articles de qualité - Index alphabétique - Liste des listes - Nouvelles pages - Tous les portails - Table des catégories

and this is what i get from while i run it

Articles de qualitΘ - Index alphabΘtique - Liste des listes - Nouvelles pages -

Tous les portails - Table des catΘgories

Thanks Very much in advance

[1559 byte] By [keithd82a] at [2007-10-2 15:37:30]
# 1

> BufferedReader bin = new BufferedReader(new InputStreamReader( inputStream, "ISO-8859-1" ));

Is there a reason you chose ISO-8859-1 here? Because it looks like the actual encoding of the page is something else. You should be able to see it in a <meta> tag near the beginning of the page, or in the <?xml> prolog if it's XHTML.

DrClapa at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...
# 2
And also, the encoding of your console isn't ISO-8859-1. That's why it looks messed up. But surely this is just a preliminary test and your application isn't really going to be displaying HTML on the console for people to read? Don't worry about the console.
DrClapa at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...
# 3

actually what i had pasted was french only, when i pasted it in the text box here and clicked on save it got converted to weird data.

Also its not only the console but it stores it the same way to the database too.

Also it shows up fine in Eclipse and another id, but on a command prompt or a shell window it shows weird characters

Thanks very much, i will try and see if this solves the problem

keithd82a at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...
# 4

If you're just using Western European languages then the simplest reliable test is to display your downloaded data in a GUI. (For example a JTextArea.) The standard fonts can display accented letters without any problem. If you don't see the right data in the GUI then you know you are applying the wrong encoding to the download.

Using the database to test is less practical because now you have the database, the JDBC driver, and whatever you use to display data from the database all as possible sources of data mangling. You want to test only one thing at a time as much as possible.

DrClapa at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...
# 5
Yea true, but i eventually need to store the information in the database in xmltype, i am guessing this may be a issue with the database character set support.I am trying this out by updating the database charcter set now.
keithd82a at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...
# 6

Get the content-type from the httpurlconnection and

then pass it , while reading the stream.

whatever we read should be the same format as whatever being sent to us.

so the above fact is true, then we can decide on the format to insert

to database depending on the DB charset.

So, please make sure to read the URL using the same content-type

as it is.

HttpURLConnection con = (HttpURLConnection)urlObject.openConnection();

String contentType = con.getContentType();

vijayramaa at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...
# 7

I figured out that issue but now there is another problem in the xml i have if there is a charcter ?it converts its to œ which is fine, but along with that it also add this character ?

So ?becomes ?#156;

Any one has any idea what that character represents, i also used to have the same problem when i was parsing xml with xsl to generate hml and was using codes like this in the xsl   (which is a space)it always used to append ?to it.

Thanks in advance any help will be highly appreciated

keithd82a at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...
# 8

The symptoms you describe indicate that you have text encoded in UTF-8, but that the text either goes through an incorrect conversion or gets displayed in something like ISO-8859-1. Your text is thoroughly garbled, so you need to figure out what the source encoding is, what kind of conversions take place, and which encoding you are viewing the text as.

Mismatch between meta data tags and the actual encoding can cause this kind of issue.

one_danea at 2007-7-13 15:16:56 > top of Java-index,Desktop,I18N...