Parsing raw HTML Text
Hello all
I am working on a program that connects to 5 different sites (yahoo google etc...) and gets the stock information/points and brings it back to one site. I have been reading a book for a while and it tells me how to connect and everything but everytime it keeps saying that when we use openStream() function of the URL class...it returns RAW HTML.
How can i get the information i want from the raw HTML? I know there isnt any Specific method that can be used to just get to the informaiton i want from the raw HTML. But a little help from you all on how to Parse HTML tags would really help
Please i would appreciate any help. Or if you guys know any good sample code online on this matter that could help.
Thanks
> I am working on a program that connects to 5
> different sites (yahoo google etc...) and gets the
> stock information/points and brings it back to one
> site. I have been reading a book for a while and it
> tells me how to connect and everything but everytime
> it keeps saying that when we use openStream()
> function of the URL class...it returns RAW HTML.
Wow. You request an HTML page and guess what, you get HTML. I wouldn't consider HTML as "raw", btw. It'S actualy very rich on content and formatting information.
> How can i get the information i want from the raw
> HTML? I know there isnt any Specific method that can
> be used to just get to the informaiton i want from
> the raw HTML. But a little help from you all on how
> to Parse HTML tags would really help
If it's true XHTML, an XML parser might help. Oterwise: simply type "java html parser" into Google. Lots of results.
How about looking into webservices instead? If you just want raw data, why not ask for it?
Ok
I am now able to strip down all the HTML tags from the raw HTML (I cant use XML sites, i wished).
So now when all the tags are off....i should look for a keyword called "Last" because near that is the information i need to extract from the data i recieved.
The following code strips down the html tags.
import java.io.IOException;
import java.io.OutputStreamWriter;
import javax.swing.text.html.*;
public class TagStripper extends HTMLEditorKit.ParserCallback{
private OutputStreamWriter out;
//private String out;
public TagStripper (OutputStreamWriter out)
{
this.out = out;
}
public void handleText(char[] text, int position)
{
try
{
out.write(text);
out.flush();
}
catch (IOException e)
{
System.err.println(e);
}
}
}
I was thinking that somewhere in the [code]handleText(char[] text, int position)[/code method I can try to search the data thats about to be spit out, for the keyword "Last". I saw that String Class has a fucntion called indexOF() which gives back the index where the supplied substring starts from.
But my question is how do i make a string and copy all the data to it, since the data is in OutputStreamWriter type?
Please give some direction