parse htmlentites with SAX
I am getting the following exception when I to parse an xml document (as a String) containg htmlentities:
org.xml.sax.SAXParseException: The entity"Atilde" was referenced, but not declared.
for
<?xml version="1.0" encoding="UTF-8" ?><id="123" path="/björk/">
...
the code looks like:
InputSource input =new InputSource();
input.setCharacterStream(new StringReader(theXMLString));
reader.parse(input);
Is there a way to make SAX aware of those htmlentities?
ok, thanks. but I still don't quite understand why there is no standard encoding/decoding class for those kind of entities (since they are standard, right?). The document I'm parsing was created by a php service (using the function htmlentites()). I wonder why php doesn't use utf-8 in the first place.
> there is no standard encoding/decoding class for
> those kind of entities (since they are standard,
> right?).
That's because these entities are declared by the DTD of HTML, which itself is based on SGML. XML, too, is based on SGML, has hardly anything to do with HTML and only knows a few, essential pre-defined entities (such as &).
> The document I'm parsing was created by a
> php service (using the function htmlentites()). I
> wonder why php doesn't use utf-8 in the first place.
I'd say it does (the "Atilde-para"-combo looks like a typical UTF-8 sequence) and, afterwards, applies HTML-encoding to be safe (which is wrong for a XML context).
ok, I was just looking at
http://www.w3schools.com/php/func_string_htmlentities.asp
Would it already help to change
<?xml version="1.0" encoding="UTF-8" ?>
to
<?xml version="1.0" encoding="ISO-8859-1"?>
so that the SAX parser knows how to deal with it?