FileReader (?) problem

I'm not sure wether this is a problem with the FileReader or Internet Explorer but as other applications seem to have no problems I tend to think its Java.

Anyway, I'm trying to read in a web page, that was saved from the browser, as a String in Java. To do this I'm using the 'FileReader' class but it seems the FileReader has some problems with webpages saved from IE (no problems with Firefox pages though).

Using this code:

try{

char[] buffer =newchar[1024];

FileReader r =new FileReader(new File("form_ie.html"));

String result ="";

while(true){

int num = r.read(buffer);

if(num <= 0)break;

result +=new String(buffer,0,num);

}

System.out.println(result);

}

catch(Exception ex){

ex.printStackTrace();

}

The result is two strange characters at the start and then every two characters has value (as byte) of 0. Opening this site in notepad/firefox/ie/... gives no problems. Any idea what causes this and what can I do about it? (I'm currently removing char's with a byte value <=0)

The webpage I'm testing can be found on: http://student.kuleuven.be/~s0109731/Rommel/ under the filename "form_ie.html". This webpage was generated using Javascript so there might be no problems for regular webpages.

Now after reading in this webpage, I extract the internal xml block. This xml block is send to a SAXParser but as this SAXParser only accepts InputStreams, I need to turn a String into an InputStream. I found the java.io.StringBufferInputStream class but its deprecated and I can't find an alternative to turn a String into an InputStream. Any suggestions?

[2339 byte] By [Lotuspeca] at [2007-10-3 5:18:42]
# 1

Without looking into it, it sounds like a character-encoding issue.

If you've got two strange bytes, they're almost certainly the byte-order-mark in a UTF-8 stream.

If the FileReader is not dealing with those automatically, then your platform's default character encoding is probably not UTF-8. Since you're on Windows, it's probably CP850 or CP437 or maybe US-ASCII or ISO-8859-1.

Either way, if the document really IS in UTF-8 (and it will help to determine this if you can say what the two bytes are), then you need to allow for this. Unfortunately, FileReader doesn't let you specify a character encoding, so you need to wrap your file in a FileInputStream and then wrap that in an InputStreamReader passing "utf-8" as the character set.

dannyyatesa at 2007-7-14 23:25:34 > top of Java-index,Core,Core APIs...
# 2

Here is a part of the output of a FileInputStream:

-1 -2 60 0 104 0 116 0 109 0 108 0 62 0 60 0 104 0 ...

After opening my webpage I found out that its "UTF-16" (seems to work now).

Now the question is: is it possible to automaticly detect the correct character set and if so are there any classes in Java that solve this problem?

Any ideas about my other question (InputStream for a String)? The StringBufferInputStream class seems to be broken for any other characters then default ASCII (correct?).

Lotuspec1a at 2007-7-14 23:25:35 > top of Java-index,Core,Core APIs...
# 3

It's not generally possible to detect the encoding, no. Some specific patterns (like the byte order marks) might hint at an encoding (in this case UTF-16 - I really didn't mean UTF-8 earlier!), but they would also be perfectly valid characters if the stream were ISO-8859-1 encoded.

XML documents tell you there encoding (in the ?xml processing instruction) - of course, parsers have to do a bit of work to figure out what the byte encoding is before they can read the characters, but given that they're looking for specific patterns, that's pretty straightforward.

HTML documents generally don't self-identify. XHTML documents can (because they're XML). So, for HTML, the web-server normally tells you the character encoding in an HTTP header. But of course, you've already lost this information.

The reasons this is working if saved in FireFox and not IE, is that I expect IE is just saving out what it received whereas FireFox will be translating to your platform's default encoding where possible.

As to your other question, StringBufferInputStream is indeed deprecated in favour of StringReader. But as you said, XML parsers need byte streams, not readers full of characters (because they want to do their own decoding according to the XML PI).

Sun's silly forum software only lets you see the message you are replying to, not the whole thread, so I don't know what you were originally after, but notice that SAXParser also has parse() methods that take Files and InputSources. These may help. Note that an InputSource can be constructed from a Reader (but it mustn't contain a BOM which we've already decided you have).

Also, why read the file into a String? Why not read it into a byte array? This would be the preferred way of dealing with XML. Then you could make a ByteArrayInputStream from the byte[], and wrap that in an InputSource and you aren't mucking with any character encoding stuff outside of the XML.

dannyyatesa at 2007-7-14 23:25:35 > top of Java-index,Core,Core APIs...
# 4

Well the problem is, the xml is embedded in the html page like this:

<html>

...

<body>

<xml>

<somexml>some more xml</somexml>

</xml>

...

some html

...

</body>

</html>

I did it this way because I found no way to directly create an xml file using Javascript (everything is client side). So I have to extract the xml part from the html file remove the xml tags and replace it with a default xml 'header' (i.e. <?xml version="1.0" ?>). So either way I end up interpreting the original html file (to search for the xml block).

Lotuspec1a at 2007-7-14 23:25:35 > top of Java-index,Core,Core APIs...