Remove XML Not In Schema
Hi,
I was wondering if anyone knew of a library that would take an XML document and a schema and remove any invalid nodes (i.e. those not in the schema). I'm wanting this so I can create a schema that's a subset of the XHTML schema and validate user inputted XHTML based on this.
Or could anyone suggest an alternative way to do this? Thanks.
-Sam
[374 byte] By [
samblake0a] at [2007-10-2 23:59:11]

Offhand, no library that I can think of. Normally, a schema is used to verify that some client is providing you with valid data ... re-arranging bad data until it looks good is a fringe application.
As for writing such a thing yourself ... it's difficult. I did the opposite, creating a library that lets you add XML elements into the proper place in the document based on a schema. But it was for a very limited problem domain, and the schema documents weren't very complex -- just nested sequences with the occasional choice thrown in.
> so I can create a schema that's a subset of the XHTML schema
> and validate user inputted XHTML based on this
This may be slightly easier. Depending on what elements you want, you can simply traverse the document and throw out anything else. Then pass the result to your validator.
If you want context, it's going to be a lot harder (ie, throw out <b> when it's inside a <dl> but not when it's inside a <ul>). In this case, you'll probably need to write a parser (using JavaCC, for example) that responds to error conditions by throwing out the offending elements.