Using JAXP 1.3 to serialize using numeric character entity references
I have looked high and low for a way to be able to ensure that my XML document is serialized with numeric character entity references and haven't found it yet.For example, I want XXXX; (where XXXX is the hex value of the character) to be output in the document instead of its UTF-8 character. Is there a setting to do this in JAXP 1.3 or with some other alternative? Here's a simplified version of the code I'm using to serialize:
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS",
"3.0");
LSOutput out = implLS.createLSOutput();
OutputStream outputStream = new FileOutputStream(outputFile);
out.setByteStream(outputStream);
out.setEncoding(TEPConstants.ACS_JOURNAL_ENCODING);
LSSerializer writer = implLS.createLSSerializer();
writer.write(doc, out);
Any help would be greatly appreciated!
Thanks,
Jeff Bailey
# 1
I'm assuming you don't want all characters serialized as numeric entities, as that would make the document 6 times as large as it needs to be, not to mention unreadable by humans. You just want that for some characters. Is that correct?
If so, what characters do you want to appear as numeric entities? (And why?)
# 2
Sorry, I should have been more specific. I want all non-ascii characters to appear as number character entity references (NCRs) primarily because of an agreement we have with an external party that needs to process them (so its more of a political reason than a technology reason).
We use the ArborText Editor and ArborText Pushing Engine and it has this capablity to either write out all non-ascii characters as either character entity references, numeric character entity references, or as the native encoding of the document. I have looked hard, but have unable to find the same sort of control when doing XML processing in Java.
# 3
Then I think your best bet is to set the encoding of the output to US-ASCII. This will force the serializer to use the numeric entities for all characters that can't be represented in US-ASCII.
Most serializers don't provide options to control that sort of thing, because the XML spec says that it doesn't matter which representation of a character you use in your document. And a compliant XML parser will process any of the possible representations. However if you have to send data to people with rubbish parsers then you gotta do what you gotta do.
Hopefully they don't demand the decimal form while your serializer decides to output the hexadecimal form, or vice versa. In that case you might have to post-process the output.