Getting the Text in an HTML Document
Hi,
I want to read the HTML of a web page. What I want is just the text that is displayed on the page. So I need to parse the html some way. I dont want any of the pictures etc. What I ultimately want is just a list of all the words that get displayed on the page. Can anyone give me any pointers of where to start? This is important for my project's initilization... I found some codes but doesn't work. i.e. following code only gives me the URL name . Any suggestion ?
Cheers,
Yigit
// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
// or a URL (e.g. http://host.com/page.html) and returns all text in the document.
publicstatic String getText(String uriStr){
final StringBuffer buf =new StringBuffer(1000);
try{
// Create an HTML document that appends all text to buf
HTMLDocument doc =new HTMLDocument(){
public HTMLEditorKit.ParserCallback getReader(int pos){
returnnew HTMLEditorKit.ParserCallback(){
// This method is whenever text is encountered in the HTML file
publicvoid handleText(char[] data,int pos){
buf.append(data);
buf.append('\n');
}
};
}
};
// Create a reader on the HTML content
URL url =new URI(uriStr).toURL();
URLConnection conn = url.openConnection();
Reader rd =new InputStreamReader(conn.getInputStream());
// Parse the HTML
EditorKit kit =new HTMLEditorKit();
kit.read(rd, doc, 0);
}catch (MalformedURLException e){
}catch (URISyntaxException e){
}catch (BadLocationException e){
}catch (IOException e){
}
// Return the text
return buf.toString();
}

