Parsing HTML to get DOM structure
I have been looking at the various XML libraries such as JTidy, HotSax, Xalan, Tagsoup, htmlparser, etc. trying to find a library which would allow me to parse some HTML, retrieving the DOM structure of the document, without trying to make it any better.
My goal is to write an application which is able to go through a huge bunch of html templates to modify some parts of it, and since these can be footers, headers, or just pieces of content, I don't want some HTML and BODY tags to be automatically generated...
Is there any way I could achieve that? All the libraries I tried ended up generating some extra HTML in the DOM structure which I wasn't able to get rid of...
[694 byte] By [
Dalzhima] at [2007-11-27 5:31:38]

There are boatloads of HTML-parsing libraries out there. I was personally under the impression that TagSoup attempted to just parse the DOM without modifying it unless you instructed it to. If that doesn't work, and you don't mind putting a bit of extra effort into it, Mozilla's open-source HTML parser is just about as good as it gets.
Joe
Joe_ha at 2007-7-12 14:57:01 >

I guess TagSoup is SAX-based, so it won't build a tree for you. Anyway, do you need to store the resulting structure? Or can you write a simple Swing parser to run through the output of TagSoup and remove the unwanted tags?
I'll look through TagSoup's API and see if there's some way to suppress those in the meantime.
Joe
Joe_ha at 2007-7-12 14:57:01 >

Well, what I'm doing is a program which can process existing HTML templates so that I can refactor some patterns we have targeted to make everything more uniform.
Thus I want to be able to read HTML code, alter it, and then produce the result without adding any extra tags guessed by a cleaner. The reason is simple, since the templates are only pieces of a final page, I don't want to end up with <html> tags inside every template piece!
Oh and it is true that TagSoup is SAX based, but I mixed it with Xalan so that it produces a DOM tree. Here's the resource I found which helped me do that:
http://www.hackdiary.com/archives/000041.html
I found a library which has finally made it possible to retrieve the original DOM structure of an HTML template without adding any extras!
http://html.xamjwg.org/cobra.jsp
The only thing I'm still concerned about is that they say they'll apply any Javascript DOM instructions inside the generated structure...