Parsers resolve entity in attribute
If there's an entity in an attribute (e.g <chapter number="§1">) the SAX and DOM start tag handlers are giving me the resolved entity (section symbol) rather than the entity text. I haven't found any properties or settings that give me the data the way it appears in the XML file.
How can I get the entity text rather than the resolved entity when parsing the XML? Are there alternative parsers that have the behavior I'm looking for?
[466 byte] By [
rmcgarveya] at [2007-11-26 13:19:42]

# 1
That's because entities are a convenience feature for authors of XML documents, and it's intended that a parser will translate the entity code into its actual meaning. Normally a program that uses XML should not care whether an entity was used by the author or not.
The only exception I can think of is an XML editor, which does actually care about external representations. So if you're writing one of those, you could try using a org.xml.sax.ext.LexicalHandler.
# 2
Thanks for the reply. In this case, the goal is an XML to XML transformation, so resolving any entities in the data or in the attributes is undesirable. An event is triggered so the entities in the data can be handled, but there's no event that I've found when an entity is encountered in an attribute.
I'm currently using DefaultHandler2, which is an implementation of LexicalHandler. None of the methods seem to find an event for an entity in an attribute. I've also looked through the available properties to see if any of them disable automatic entity translation, but to no avail.
Are you thinking of something specific in the LexicalHandler class that would help?
# 3
> Thanks for the reply. In this case, the goal is an
> XML to XML transformation, so resolving any entities
> in the data or in the attributes is undesirable.
Why is that? The version with the entities resolved is equivalent to the version with the entities in it. Is there a reason to consider them different?
> Are you thinking of something specific in the
> LexicalHandler class that would help?
No, just pointing to a useful class on the hypothesis you didn't know about it.
# 4
Hi!
I was also looking for a way of keeping the entities in attributes, preferably using DOM. Got it now that the parsers does not support it, and the most common answer to this query seems to be that you are doing something strange in you application.
But I have a case here which I for sure don't see as odd or strange which I would like to share.
In my case I have a graphical editor that uses XML for storing the configurations that you can create and edit. Using references for attributes here is great for some purposes, define a value at one place and reuse in many XML files, change the value once and all references gets updated. But when the files passes through the editor all references are resolved and I cannot encode/serialize them back when storing the configuration.
Just wanted to share this even if this was an old thread, and I find it strange that you cannot keep both the resolved reference as well as the reference itself in a DOM tree so you can decide yourself how to handle the attribute.
jopia at 2007-7-7 17:46:52 >

# 5
I managed this for an XML pretty-print plug-in I'm writing. I'm using a hack to hide entity references from the parser.
Basically, you url-encode the ampersands (change them all to %26) in the string representation of the original document, so that the parser leaves them the heck alone.
That looks like this:
public Document getDom(String str) throws SAXException {
Document d = null;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder db = dbf.newDocumentBuilder();
// url-encode the ampersands
Reader r = new StringReader(str.replaceAll("&","%26"));
InputSource is = new InputSource(r);
d = db.parse(is);
} catch (Exception e) {
// IOException or ParserConfigurationException
e.printStackTrace();
}
return d;
}
Then, when you're done with the parser, you can decode the ampersands back ( mySerializedString.replaceAll("%26","&"); ) and your entities come back as good as new.
Of course, if you have any url-encoded ampersands in your original document this second step will decode those as well. You could try to account for that with an additional step of encoding/decoding, but I'm sure that would upset something else.
# 6
Thanks Petiex, but then you have the problem of not having access to the resolved values.
I just cannot understand why the standard would not allow you to have unresolved entities as well as the resolved values in our tree.
Seriously thinking about making a hack in the DOM parser, but I would really like to avoid this.
/jopi (jopi2 since I think I forgot to verify my account)
Message was edited by:
jopi2