Changing HTML to text file using java

Hi,

I am doing a project in which i have to read news articles from websites. I have tried XML but for that i need to know which tag has the article in it. Since i have to read from various websites so each site used different tags for different informations.

Is there anyway that i can change an HTML file into a text file using java. Maybe some command that removes all the HTML tags and gives just the information. Is there anything else that anyone would like to recommend?

Thanx

yafis

[521 byte] By [aliyafisa] at [2007-10-2 10:56:05]
# 1

I think you'll have to just look at the HTML file carefully. Your browser should have a view source function. Failing that, save the .html file, and open it with a text editor.

I mean, HTML is text...

If you wanted to overkill things, there are probably code samples where you can view the contents of an HTML file as a tree.

Laughingmana at 2007-7-13 3:20:47 > top of Java-index,Java Essentials,Java Programming...
# 2
Take a look at one of these open source Java HTML parsers: http://java-source.net/open-source/html-parsers
prometheuzza at 2007-7-13 3:20:47 > top of Java-index,Java Essentials,Java Programming...
# 3

You could also load up a webpage with a JEditorPane, select all it's contents, and print them in the fashion

JEditorPane pane = new JEditorPane(url);

pane.selectAll();

System.out.println(pane.getSelection());

The problem with this however is that the pane takes awhile to load the page.

Upon rereading the post, I think you might be doing something somewhat similar. As I was.

Laughingmana at 2007-7-13 3:20:47 > top of Java-index,Java Essentials,Java Programming...
# 4

Maybe something like this:

import java.io.*;

import java.net.*;

import javax.swing.text.*;

import javax.swing.text.html.*;

class GetHTMLText

{

public static void main(String[] args)

throws Exception

{

EditorKit kit = new HTMLEditorKit();

Document doc = kit.createDefaultDocument();

// The Document class does not yet handle charset's properly.

doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

// Create a reader on the HTML content.

Reader rd = getReader(args[0]);

// Parse the HTML.

kit.read(rd, doc, 0);

// The HTML text is now stored in the document

System.out.println( doc.getText(0, doc.getLength()) );

}

// Returns a reader on the HTML data. If 'uri' begins

// with "http:", it's treated as a URL; otherwise,

// it's assumed to be a local filename.

static Reader getReader(String uri)

throws IOException

{

// Retrieve from Internet.

if (uri.startsWith("http:"))

{

URLConnection conn = new URL(uri).openConnection();

return new InputStreamReader(conn.getInputStream());

}

// Retrieve from file.

else

{

return new FileReader(uri);

}

}

}

camickra at 2007-7-13 3:20:47 > top of Java-index,Java Essentials,Java Programming...
# 5
Not sure what constraints you're operating under, but if you can pick which Web site you get news from, you might pick one that has RSS. That way you don't have to guess the article's tag name.
Dick_Adamsa at 2007-7-13 3:20:47 > top of Java-index,Java Essentials,Java Programming...