Multithreaded URL Fetching
Dear All,
This question regards programming a Web Spider in the most optimal way (in terms of speed). The spider starts with a URL, downloads the file, extracts all links, downloads those links etc. in a breadth-first manner.
If we wish to utilize the bandwidth of the machine running the spider, creating more than one thread which fetches pages makes sense, right? This way we can utilize the fact that multiple threads can fetch pages concurrently. However, how can we delegate execution time so that other threads can run while the current one is waiting from a response from the server? Does it make sense to use Thread.yield just before fetching a page, or just after?
Pseudo-code Fetcher Thread:
fetch(URL url){
source = download(url)
links = extractLinks(source)
for-each link in links
fetch (links)
}
Any thoughts welcome,
Cheers /F
[967 byte] By [
fw9189a] at [2007-11-27 9:46:21]

what you're suggesting is very much frowned upon by people running websites, as it tends to seriously bog down their servers and bandwidth.It's also almost universally used as a means to steal content, something else that's not regarded as a nice thing to do.
Well, as long as you obey robots.txt and don't revisit the same site too frequently I can't see why this would be such a bad thing. It all depends on what you use the crawled data for. Google and Yahoo crawls millions of sites each day.
Btw. this question regards how to most efficiently program a multi-threaded application that utilizes the wait time for socket connections.
> However, how can we delegate
> execution time so that other threads can run while
> the current one is waiting from a response from the
> server?
It happens automatically.
> Does it make sense to use Thread.yield just
> before fetching a page, or just after?
Neither. The non-blocked threads will run and the blocked thread will block. Nothing you have to do about it, and nothing you can do about it.
ejpa at 2007-7-12 23:56:47 >
