parse htmlentites with SAX

I am getting the following exception when I to parse an xml document (as a String) containg htmlentities:

org.xml.sax.SAXParseException: The entity"Atilde" was referenced, but not declared.

for

<?xml version="1.0" encoding="UTF-8" ?><id="123" path="/bj&Atilde;&para;rk/">

...

the code looks like:

InputSource input =new InputSource();

input.setCharacterStream(new StringReader(theXMLString));

reader.parse(input);

Is there a way to make SAX aware of those htmlentities?

[743 byte] By [mischka24a] at [2007-10-3 0:54:14]
# 1
http://www.xml.com/pub/a/98/08/xmlqna2.htmlHowever, you could also insert the UTF-8 encoding of ?instead of "&Atilde".
quittea at 2007-7-14 17:49:31 > top of Java-index,Java Essentials,Java Programming...
# 2

ok, thanks. but I still don't quite understand why there is no standard encoding/decoding class for those kind of entities (since they are standard, right?). The document I'm parsing was created by a php service (using the function htmlentites()). I wonder why php doesn't use utf-8 in the first place.

mischka24a at 2007-7-14 17:49:31 > top of Java-index,Java Essentials,Java Programming...
# 3

> there is no standard encoding/decoding class for

> those kind of entities (since they are standard,

> right?).

That's because these entities are declared by the DTD of HTML, which itself is based on SGML. XML, too, is based on SGML, has hardly anything to do with HTML and only knows a few, essential pre-defined entities (such as &amp;).

> The document I'm parsing was created by a

> php service (using the function htmlentites()). I

> wonder why php doesn't use utf-8 in the first place.

I'd say it does (the "Atilde-para"-combo looks like a typical UTF-8 sequence) and, afterwards, applies HTML-encoding to be safe (which is wrong for a XML context).

quittea at 2007-7-14 17:49:31 > top of Java-index,Java Essentials,Java Programming...
# 4

ok, I was just looking at

http://www.w3schools.com/php/func_string_htmlentities.asp

Would it already help to change

<?xml version="1.0" encoding="UTF-8" ?>

to

<?xml version="1.0" encoding="ISO-8859-1"?>

so that the SAX parser knows how to deal with it?

mischka24a at 2007-7-14 17:49:31 > top of Java-index,Java Essentials,Java Programming...
# 5
No, the use of HTML entities is the real problem. Get rid of those:<?xml version="1.0" encoding="UTF-8" ?><id="123" path="/bj枚rk/">
quittea at 2007-7-14 17:49:31 > top of Java-index,Java Essentials,Java Programming...
# 6
ok, I finally understand the problem. thanks a lot for your help.
mischka24a at 2007-7-14 17:49:31 > top of Java-index,Java Essentials,Java Programming...