How do I obtain data from a web site?
I am trying to use information from a sports stats page to automate scoring on a fantasy football site I am developing. After a game the weeks games have completed, I want to be able to extract the data, passing yards, passing TD's, etc. I can pull in a web site, and get to the point where the QB stats start. My question is there a way to strip all the html tags and just be left with the players name and numbers(stats)? I skip down 366 lines to get to the example below. Was thinking maybe, to tokenize using > as the delimiiter, then tokenizing again to strip away the rest of the tag using < as the delimiter. Any suggestions are much appreciated, thanks.
CODE
import java.net.*;
import java.io.*;
public class WebRipper
{
public static void main(String[] argv)
{
int count = 0;
try {
// Create a URL for the desired page
URL url = new URL("http://www.fftoday.com/stats/playerstats.php?Season=2006&GameWeek=1&PosID=10&Le agueID=1");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null)
{
if(count > 366)
{
System.out.println(str);
}// end if
count ++;
}// end while
in.close();
}// end try
catch (MalformedURLException e) {}
catch (IOException e) {}
}// end main
}// end WebRipper
OUTPUT(this is just a example of the first few lines of output)
<TD CLASS="sort1" ALIGN="LEFT" BGCOLOR="#ffffff"> 1. <A HREF="playerprofile.php?PlayerID=1607&LeagueID =1">Donovan McNabb</A></TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">PHI</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">24</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">35</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">314</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">3</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">4</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">7</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">0</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#e0e0e0">28.4</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">28.4</TD>
</TR>
[2659 byte] By [
Xantom07a] at [2007-11-27 11:42:05]

Search the web for: InfoFetcher.java
lol i just got done doing a program similar to this use pattern and matcher class something like this
Pattern p = Pattern.compile("(?<=\\>)[^<]+");
Matcher matcher = p.matcher(yourwebinfo);
while(matcher.find())
{
yourinfow/ohtmlcode = matcher.group();
}
o also import this
import java.util.regex.Pattern;
import java.util.regex.Matcher;
and if you want it to run faster substring your data when you stream it in if you don't want the whole website only a portion :)
Message was edited by:
mark07
> I skip down 366 lines to get to the example below.
Sounds brittle. Have you thought about parsing the HTML?
~
Most certainly it is, it was just the fastest way to get to what I wanted. It will not be in the finished product;)
once you learn how to do regex it makes this a whole lot easier the one i gave you worked in my case not sure if you have other stuff you want in your html code
> once you learn how to do regex it makes this a whole lot easier
Regex and HTML is brittle. That's why I recommended looking at parsing instead.
~
Here is the first string I would want to obtain, can you explain how I would get the player name, then all of his stats. Remember there is ~366 lines of text before you get to this point. Also what or where should I dump the info to?
Example from the info below I want Donovan McNabb, then all of his stats, excluding all the tags.
<TD CLASS="sort1" ALIGN="LEFT" BGCOLOR="#ffffff"> 1. <A HREF="playerprofile.php?PlayerID=1607&LeagueID =1">Donovan McNabb</A></TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">PHI</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">24</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">35</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">314</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">3</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">4</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">7</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">0</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#e0e0e0">28.4</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">28.4</TD>
</TR>
Pattern p = Pattern.compile("(?<=\\>)[^<]+");
Matcher matcher = p.matcher(yourwebinfo);
while(matcher.find())
{
yourinfow/ohtmlcode = matcher.group();
}
Thanks.
> Here is the first string I would want to obtain, can you explain how I would get the player name, then all of his stats.
I'd recommend parsing the HTML, using a 3PL such as JTidy. You can then walk through the DOM, processing as needed.
~
import java.util.regex.Pattern;
import java.util.regex.Matcher;
String[] player = new String[1];
String[] revised = new String[1];
int end = str.indexOf("</TD>", 366);
player[1] = str.substring(366, end);
Pattern p = Pattern.compile("(?<=\\>)[^<]+");
Matcher matcher = p.matcher(player[1]);
while(matcher.find())
{
player[1] + = matcher.group();
}
Maybe I should explain better how this is all working.
The user will log in, then choose and submit players to the Database(mysql).After the week is completed, I want to be able to obtain stats from a web site, ESPN for example. The player name and weekly stats, will be used to update the scoreboard each week. Maybe there would be a better site to get the info from. Thanks again.
String[] player = new String[1];
String[] revised = new String[1];
int end = str.indexOf("</TD>", 366);
player[1] = str.substring(366, end);
///~
Good luck with that. :o)
~
> Maybe there would be a better site to get the info from.
Have you checked for any APIs, services, or stat feeds?
~
what website are you using? and yes i know it wouldn't work awsome but it would work :)
> and yes i know it wouldn't work awsome but it would work :)
As long as the server always generated exactly the same HTML output, yeah... ;o)
~
ya well in my case its a buisness that just updates table fields so i can use it but that's why I asked for a website :0 anyway you can usuly rely on <> for end of stuff
> ya well in my case its a buisness that just updates
> table fields
Right; a controlled environment in which you can (typically) rely on a set specification. The OP doesn't have that luxury.
> anyway you can usuly rely on <> for end of stuff
It's that "usually" part that makes using regex for "out-in-the-wild" HTML brittle and not terribly useful.
~
yup that's why i loved controlled environments becuase regex works great but outside it's not realiable
> yup that's why i loved controlled environments
> becuase regex works great but outside it's not
> realiable
It's probably good form then to caution people when you're providing that as a proposed solution.
~
true but i thought he was getting this out of a table
> true but i thought he was getting this out of a table
That doesn't change anything I said.
~