Changing HTML to text file using java
Hi,
I am doing a project in which i have to read news articles from websites. I have tried XML but for that i need to know which tag has the article in it. Since i have to read from various websites so each site used different tags for different informations.
Is there anyway that i can change an HTML file into a text file using java. Maybe some command that removes all the HTML tags and gives just the information. Is there anything else that anyone would like to recommend?
Thanx
yafis
[521 byte] By [
aliyafisa] at [2007-10-2 10:56:05]

I think you'll have to just look at the HTML file carefully. Your browser should have a view source function. Failing that, save the .html file, and open it with a text editor.
I mean, HTML is text...
If you wanted to overkill things, there are probably code samples where you can view the contents of an HTML file as a tree.
You could also load up a webpage with a JEditorPane, select all it's contents, and print them in the fashion
JEditorPane pane = new JEditorPane(url);
pane.selectAll();
System.out.println(pane.getSelection());
The problem with this however is that the pane takes awhile to load the page.
Upon rereading the post, I think you might be doing something somewhat similar. As I was.
Maybe something like this:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
class GetHTMLText
{
public static void main(String[] args)
throws Exception
{
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
// The Document class does not yet handle charset's properly.
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
// Create a reader on the HTML content.
Reader rd = getReader(args[0]);
// Parse the HTML.
kit.read(rd, doc, 0);
// The HTML text is now stored in the document
System.out.println( doc.getText(0, doc.getLength()) );
}
// Returns a reader on the HTML data. If 'uri' begins
// with "http:", it's treated as a URL; otherwise,
// it's assumed to be a local filename.
static Reader getReader(String uri)
throws IOException
{
// Retrieve from Internet.
if (uri.startsWith("http:"))
{
URLConnection conn = new URL(uri).openConnection();
return new InputStreamReader(conn.getInputStream());
}
// Retrieve from file.
else
{
return new FileReader(uri);
}
}
}