How do I obtain data from a web site?

I am trying to use information from a sports stats page to automate scoring on a fantasy football site I am developing. After a game the weeks games have completed, I want to be able to extract the data, passing yards, passing TD's, etc. I can pull in a web site, and get to the point where the QB stats start. My question is there a way to strip all the html tags and just be left with the players name and numbers(stats)? I skip down 366 lines to get to the example below. Was thinking maybe, to tokenize using > as the delimiiter, then tokenizing again to strip away the rest of the tag using < as the delimiter. Any suggestions are much appreciated, thanks.

CODE

import java.net.*;

import java.io.*;

public class WebRipper

{

public static void main(String[] argv)

{

int count = 0;

try {

// Create a URL for the desired page

URL url = new URL("http://www.fftoday.com/stats/playerstats.php?Season=2006&GameWeek=1&PosID=10&Le agueID=1");

// Read all the text returned by the server

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

String str;

while ((str = in.readLine()) != null)

{

if(count > 366)

{

System.out.println(str);

}// end if

count ++;

}// end while

in.close();

}// end try

catch (MalformedURLException e) {}

catch (IOException e) {}

}// end main

}// end WebRipper

OUTPUT(this is just a example of the first few lines of output)

<TD CLASS="sort1" ALIGN="LEFT" BGCOLOR="#ffffff"> 1. <A HREF="playerprofile.php?PlayerID=1607&LeagueID =1">Donovan McNabb</A></TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">PHI</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">24</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">35</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">314</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">3</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">4</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">7</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">0</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#e0e0e0">28.4</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">28.4</TD>

</TR>

[2659 byte] By [Xantom07a] at [2007-11-27 11:42:05]
# 1

Search the web for: InfoFetcher.java

tjacobs01a at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 2

lol i just got done doing a program similar to this use pattern and matcher class something like this

Pattern p = Pattern.compile("(?<=\\>)[^<]+");

Matcher matcher = p.matcher(yourwebinfo);

while(matcher.find())

{

yourinfow/ohtmlcode = matcher.group();

}

o also import this

import java.util.regex.Pattern;

import java.util.regex.Matcher;

and if you want it to run faster substring your data when you stream it in if you don't want the whole website only a portion :)

Message was edited by:

mark07

mark07a at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 3

> I skip down 366 lines to get to the example below.

Sounds brittle. Have you thought about parsing the HTML?

~

yawmarka at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 4

Most certainly it is, it was just the fastest way to get to what I wanted. It will not be in the finished product;)

Xantom07a at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 5

once you learn how to do regex it makes this a whole lot easier the one i gave you worked in my case not sure if you have other stuff you want in your html code

mark07a at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 6

> once you learn how to do regex it makes this a whole lot easier

Regex and HTML is brittle. That's why I recommended looking at parsing instead.

~

yawmarka at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 7

Here is the first string I would want to obtain, can you explain how I would get the player name, then all of his stats. Remember there is ~366 lines of text before you get to this point. Also what or where should I dump the info to?

Example from the info below I want Donovan McNabb, then all of his stats, excluding all the tags.

<TD CLASS="sort1" ALIGN="LEFT" BGCOLOR="#ffffff"> 1. <A HREF="playerprofile.php?PlayerID=1607&LeagueID =1">Donovan McNabb</A></TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">PHI</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">24</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">35</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">314</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">3</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">4</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">7</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">0</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#e0e0e0">28.4</TD>

<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">28.4</TD>

</TR>

Pattern p = Pattern.compile("(?<=\\>)[^<]+");

Matcher matcher = p.matcher(yourwebinfo);

while(matcher.find())

{

yourinfow/ohtmlcode = matcher.group();

}

Thanks.

Xantom07a at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 8

> Here is the first string I would want to obtain, can you explain how I would get the player name, then all of his stats.

I'd recommend parsing the HTML, using a 3PL such as JTidy. You can then walk through the DOM, processing as needed.

~

yawmarka at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 9

import java.util.regex.Pattern;

import java.util.regex.Matcher;

String[] player = new String[1];

String[] revised = new String[1];

int end = str.indexOf("</TD>", 366);

player[1] = str.substring(366, end);

Pattern p = Pattern.compile("(?<=\\>)[^<]+");

Matcher matcher = p.matcher(player[1]);

while(matcher.find())

{

player[1] + = matcher.group();

}

mark07a at 2007-7-29 17:42:04 > top of Java-index,Java Essentials,Java Programming...
# 10

Maybe I should explain better how this is all working.

The user will log in, then choose and submit players to the Database(mysql).After the week is completed, I want to be able to obtain stats from a web site, ESPN for example. The player name and weekly stats, will be used to update the scoreboard each week. Maybe there would be a better site to get the info from. Thanks again.

Xantom07a at 2007-7-29 17:42:05 > top of Java-index,Java Essentials,Java Programming...
# 11

String[] player = new String[1];

String[] revised = new String[1];

int end = str.indexOf("</TD>", 366);

player[1] = str.substring(366, end);

///~

Good luck with that. :o)

~

yawmarka at 2007-7-29 17:42:05 > top of Java-index,Java Essentials,Java Programming...
# 12

> Maybe there would be a better site to get the info from.

Have you checked for any APIs, services, or stat feeds?

~

yawmarka at 2007-7-29 17:42:05 > top of Java-index,Java Essentials,Java Programming...
# 13

what website are you using? and yes i know it wouldn't work awsome but it would work :)

mark07a at 2007-7-29 17:42:05 > top of Java-index,Java Essentials,Java Programming...
# 14

> and yes i know it wouldn't work awsome but it would work :)

As long as the server always generated exactly the same HTML output, yeah... ;o)

~

yawmarka at 2007-7-29 17:42:05 > top of Java-index,Java Essentials,Java Programming...
# 15

ya well in my case its a buisness that just updates table fields so i can use it but that's why I asked for a website :0 anyway you can usuly rely on <> for end of stuff

mark07a at 2007-7-29 17:42:09 > top of Java-index,Java Essentials,Java Programming...
# 16

> ya well in my case its a buisness that just updates

> table fields

Right; a controlled environment in which you can (typically) rely on a set specification. The OP doesn't have that luxury.

> anyway you can usuly rely on <> for end of stuff

It's that "usually" part that makes using regex for "out-in-the-wild" HTML brittle and not terribly useful.

~

yawmarka at 2007-7-29 17:42:09 > top of Java-index,Java Essentials,Java Programming...
# 17

yup that's why i loved controlled environments becuase regex works great but outside it's not realiable

mark07a at 2007-7-29 17:42:09 > top of Java-index,Java Essentials,Java Programming...
# 18

> yup that's why i loved controlled environments

> becuase regex works great but outside it's not

> realiable

It's probably good form then to caution people when you're providing that as a proposed solution.

~

yawmarka at 2007-7-29 17:42:09 > top of Java-index,Java Essentials,Java Programming...
# 19

true but i thought he was getting this out of a table

mark07a at 2007-7-29 17:42:09 > top of Java-index,Java Essentials,Java Programming...
# 20

> true but i thought he was getting this out of a table

That doesn't change anything I said.

~

yawmarka at 2007-7-29 17:42:09 > top of Java-index,Java Essentials,Java Programming...