Webpage (HTML) parsing...
Any ideas on how to parse an HTML page? I'm trying to do it with a StreamTokenizer but with little success. I don't think this class was made to do this sort of thing, Oridnarilly anyway. Is there a better choice? StringTokenizer? Here's what I have so far:
URLConnection uc = url.openConnection();
BufferedReader br =new BufferedReader(new InputStreamReader
(uc.getInputStream()));
StreamTokenizer stok =new StreamTokenizer(br);
stok.eolIsSignificant(false);
String inputLine;
for (int i=0; (stok.nextToken() != stok.TT_EOF); i++)
{
System.out.println("token #" + i + stok.toString());
}
It gives me a result like this:
token #0Token['<'], line 3
token #1Token[script], line 3
token #2Token[language], line 3
token #3Token['='], line 3
token #4Token[javascript], line 3
token #5Token['>'], line 3
token #6Token['<'], line 4
token #7Token['!'], line 4
token #8Token['-'], line 4
token #9Token['-'], line 4
token #10Token[function], line 5
token #11Token[dojump], line 5
token #12Token['('], line 5
token #13Token[')'], line 5
token #14Token['{'], line 6
token #15Token[document.location.href], line 7
token #16Token['='], line 7
token #17Token[play247.asp?page=promo&id=72&r=R2], line 7
What I want is all the links that have "promo" as a parameter e.g. <a href=play247.asp?page=promo&id=72&r=R2>. Any suggestions?
[2128 byte] By [
johnnyr77] at [2007-9-27 21:02:44]

Use regular expression:
..
import java.util.regex.*;
..
/*
matching groups:
scheme = $2authority = $4path = $5
query = $7fragment = $9
*/
public boolean chkURI(String source) {
final String
re = "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?$";
/* 123 4 5678 9*/
//Matcher m = Pattern.compile(re).matcher(source);
//if (m.matches())
//for (int j=0; j <= m.groupCount(); j++)
//System.out.println(j+") "+m.group(j));
return source.matches(re);
}
..
remove comments inside method and also check if "promo" is part of source string..many ways to solve the problem.
Regards.
Have you considered the javax.swing.text.html.HTMLEditorKit.Parser and javax.swing.text.html.HTMLEditorKit.ParserCallback classesWhat you do is extend HTMLEditorKit.ParseCallback to handle anchor tags (HTML.Tag.A).Check it out.
> Use regular expression:It is not possible to write a regular expression which will handle all legal html.So obviously a parser would be a better idea.
Extend javax.swing.text.html.HTMLEditorKit.ParserCallback & every time you come across a hyperlink in the handleStartTag method check if a.getAttribute(HTML.Attribute.HREF) - where a is the HREF attribute - contains the String promo - maybe like this str.indexOf("promo") - if it does , you have your link.
HTH
Ashish H.
If more help required mail me alongwith a URL to this thread
Extend javax.swing.text.html.HTMLEditorKit.ParserCallback & every time you come across a hyperlink in the handleStartTag method check if a.getAttribute(HTML.Attribute.HREF) - where a is the HREF attribute - contains the String promo - maybe like this str.indexOf("promo") - if it does , you have your link.
HTH
Ashish H.
If more help required mail me alongwith a URL to this thread
Check out Voodoo Exam - http://www.geocities.com/gnashes30/java/mock.htm - A freeware mock exam for SCJP aspirants & anyone can extend it for other certs aswell.
Also see: http://htmlparser.sourceforge.net/HTH
"It is not possible to write a regular expression which will handle all legal html."I wouldn't post that in a Perl forum because some overzealous Perl guru will prove you wrong.
> I wouldn't post that in a Perl forum because some overzealous Perl guru will prove you wrong.No they won't. It's provable that no regular expression can match all legal HTML. You can nest table tags to any level; apply the pumping lemma.
> "It is not possible to write a regular expression
> which will handle all legal html."
>
> I wouldn't post that in a Perl forum because some
> overzealous Perl guru will prove you wrong.
I base that statement on a similar discussion in the book "Mastering Regular Expressions". The author could be wrong but they seem to know what they are talking about.
hi jhonny your best bit will be to use the HTMLEditorKit as follows:
BufferedReader in= new BufferedReader(new FileReader(args[0]));
HTMLEditorKit htmlkit=new HTMLEditorKit();
HTMLDocument htmlDoc=(HTMLDocument)htmlkit.createDefaultDocument();
HTMLEditorKit.Parser parser=new ParserDelegator();
HTMLEditorKit.ParserCallback callback=htmlDoc.getReader(0);
parser.parse(in,callback,true);
public static void main(String[]args) throws IOException {
if (args.length!=1)
{
{System.out.print("bla bla bla");
}
for(HTMLDocument.Iterator iterator=htmlDoc.getIterator(HTML.Tag.A);
iterator.isValid();
iterator.next()){
AttributeSet attributes=iterator.getAttributes();
String scString=(String)
attributes.getAttribute(HTML.Attribute.HREF);
System.out.println(scString);
}
System.exit(0);
}
GOOD LUCK
Well, the real trouble is the need to ensure that the HTML code is technically correct. It is in fact technically impossible to write a single regular expression which will match any and all valid HTML code, or for that matter C/C++ or Java code. This is because regular expressions do not provide any way to differentiate nesting levels of things such as brackets and quotation marks. In other words, a single regular expression cannot, say, count the parentheses in an equation to see if every opening parenthesis is properly matched by a closing parenthesis.
Note that this is no real problem, as it's very easy to deal with this in a prepocessing stage. All you need is a Reader and Writer, and a single integer for each "nestable" character set (as in for Java, one integer for the characters "(" and ")", one for "{" and "}", and so on). You scan the characters from the original text, writing them to the new text as you go. When you come to a "nestable" character, you write it, add in the appropriate nesting level expressed as a character, and alter the nesting level. For example, the mathematical expression "3*(5+(8/9.0))" would be converted to "3*(\0x00005+(\0x00018/9.0)\0x0001)\0x0000)" (I apologize for the difficulty in reading this.) If the code is invalid due to improper use of nesting (examples: "3+((5)", ")4("), you'll discover this during this preprocessing stage, before even starting to parse it.
In the case of HTML, the definition "nestable character" would also include any valid HTML tag which comes in an opening and closing variety, and the preprocessor would have to be able to acceptably handle the strange nesting behavior such as "<b><i>This<\b><\i>" allowed in HTML. I've found that after this step (and sometimes a few other basic preprocessing steps), regular expressions can be used to match valid code in most programming, scripting, or document formatting languages.
Note that preprocessing is only necessary if you have an actual need to write a regular expression which matches any and all valid code. If you don't care about how your program will behave when given invalid code, regular expressions will do just fine for most pattern matching purposes.
Is <b><i></b></i> valid? I thought it wasn't.
It isn't but you still have to deal with it. Otherwise your users will say "But (insert name of promiscuous browser here) displays it, what's your problem?"
> Well, the real trouble is the need to ensure that the
> HTML code is technically correct. It is in fact
> technically impossible to write a single regular
> expression which will match any and all valid HTML
> code, or for that matter C/C++ or Java code.
Just to clarify...
If there is such a thing as "valid" HTML code, then you can write java or C/C++ code to validate it. That is not the same as saying that you can't write a regular expression to validate it.