Getting the Text in an HTML Document

Hi,

I want to read the HTML of a web page. What I want is just the text that is displayed on the page. So I need to parse the html some way. I dont want any of the pictures etc. What I ultimately want is just a list of all the words that get displayed on the page. Can anyone give me any pointers of where to start? This is important for my project's initilization... I found some codes but doesn't work. i.e. following code only gives me the URL name . Any suggestion ?

Cheers,

Yigit

// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)

// or a URL (e.g. http://host.com/page.html) and returns all text in the document.

publicstatic String getText(String uriStr){

final StringBuffer buf =new StringBuffer(1000);

try{

// Create an HTML document that appends all text to buf

HTMLDocument doc =new HTMLDocument(){

public HTMLEditorKit.ParserCallback getReader(int pos){

returnnew HTMLEditorKit.ParserCallback(){

// This method is whenever text is encountered in the HTML file

publicvoid handleText(char[] data,int pos){

buf.append(data);

buf.append('\n');

}

};

}

};

// Create a reader on the HTML content

URL url =new URI(uriStr).toURL();

URLConnection conn = url.openConnection();

Reader rd =new InputStreamReader(conn.getInputStream());

// Parse the HTML

EditorKit kit =new HTMLEditorKit();

kit.read(rd, doc, 0);

}catch (MalformedURLException e){

}catch (URISyntaxException e){

}catch (BadLocationException e){

}catch (IOException e){

}

// Return the text

return buf.toString();

}

[3251 byte] By [Philaphia] at [2007-10-3 3:29:32]
# 1
hi,I used jericho to parse html document and it worked well for me.try this link : http://java-source.net/open-source/html-parsershth
java_2006a at 2007-7-14 21:23:18 > top of Java-index,Java Essentials,Java Programming...
# 2

try http://htmlparser.sourceforge.net/

it's very simple, with samples.

Code:

StringExtractor se = new StringExtractor("http://www.google.co.uk");

System.out.println(se.extractStrings(false));

Ouput:

Google

Personalised Home | Sign in

Web Images Groups News Froogle more ?br>Advanced Search

Preferences

Language Tools

Search: the web pages from the UK

Advertising Programmes - Business Solutions - About Google - Go to Google.com

?006 Google

simple as that!

lfschucka at 2007-7-14 21:23:18 > top of Java-index,Java Essentials,Java Programming...