Understanding the modification of a web page?
Hi all,
Is there a way to understand that a web page is modified since a pre-defined day in java? Or can it be implemented?
What I preceisely need is that I am creating a web spider and it will be reusable. I am recording the web pages I visit and the times I visit them. And when I re run my spider it should only visit the webpages that are modified since the time that I last visited. So I need to learn when the web page is last visited.
All answers that include how to get the length (in kb.s) of the page are also accepted since I can compare the current length and the other length to understand either the page is modified or not.
I have already tried getLastModified() and getLength() methods of URLConnection but sorry they don't work since a lot of web pages don't include them in header section.
Thanks for all help...
in my experience, 80% of the crawler's time is spent parsing/scraping the pages it encounters. therefor you can visit the pages, download them, but only parse them if they have changed. best way to track changes is to calculate the md5 sum of the pages, and store it in a (URL, MD5) hashmap.
i have an md5 class that you can use:
http://doesthatevencompile.com/current-projects/code-sniplets/Checksum.htm
the lib that class requires can be found in
http://doesthatevencompile.com/current-projects/code-sniplets/lib
Hi,thanks for response. Can you explain the MD5 part more detailed? I have heard about it for the first time and if you can explain more it will be easier to me to understand the codes and integrate it in my code.Thanks, regards...
well, you can google it.
all you need to know if that an md5 is a unique number (string in this case) which maps to a bigger number(a file in this case)
so if the file is modified, the md5 that is calculated from the file will not match the old md5, and since md5 is a rather small number, you can store it in a data structure instead of storing the whole file
i doubt i did a very good job explaining it, so google it
Hi,I think I understand it quite well. However doen't it need to read all the file? If it does what is difference between writing it to a temo file and compare the lengths (in kb.s) to understand the modificarion?Thanks again.
performance difference:
your way would be faster (dont know by how much), but less accurate.
md5 is slower, but almost 100% accurate.
only you can decide which method you should use.
i would use md5 if the pages i was crawling were huge, and each time one was parsed, it would be a significant performance hit, so parsing a page an extra time would be really bad.
on the other hand, if the pages are small, parsing not cpu entensive, and you want a quick way to code something and compare, use the size method
Hi,Oh, ok. I get it now, thanks for pointing out this method. I still wonder if there is an easier (less expensive) method, algoritm out there. So anybody knows, please inform me too. Thanks all.
what you propose here is to find out a way to know if a page was changed without looking at it, i.e. downloading it.
i dont think this is possible, unless ofcourse, you can find out from somehow from the server response, look into that, but i doubt it exists.
what are you doing anyway, what crawler are you using? are you rolling your own?
Hi,
Yes that's 100% true. I try to find a method to understand the modification without downloading the whole page. It will work very fast if there is any.
I have written and trying to improve my own web spider. It is a spider that only downloads Turkish web pages which I need for a project. I am creating a log file that includes the web pages visited and the time of visit while visiting it with the spider and to add the functuality of reusing, I have to somehow understand if the page is modified or not. That's more or the less whole scenerio. Hope that explains more about my problem.
Thanks for help...
well actually now that i think about it, you CAN use http response headers.
for example for this page the response headers are:
Server: Sun Java System Web Server 6.1
Date: Fri, 29 Jun 2007 14:34:42 GMT
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en
Set-Cookie: jive.user.lastvisited=;Expires=Sun, 29-Jul-2007 14:34:42 GMT;Path=/
Transfer-Encoding: chunked
200 OK
so that doesnt give you anything, so you must fall back on downloading the page to see if it was changerd.. but another site has this response header:
Date: Fri, 29 Jun 2007 14:36:08 GMT
Server: Apache
Vary: Host
X-Powered-By: PHP/4.4.7
Keep-Alive: timeout=2, max=15000
Connection: Keep-Alive
Content-Type: text/html
Content-Encoding: gzip
Content-Length: 3962
200 OK
which gives you the Content-Length: 3962
also google says there is a last-modified header which some servers may use.
you might be able to use a hybrid technique to speed up your checking.
Hi,
Yes that was the point! I have already thought about an algorithm that uses the advantage of the HTML headers of length and last modified, however the problem is most of the web sites, expeciallt Turkish ones unfortunately, doesn't include them. In addition, these info also can be misleading which is worse than ever (since it is entered by another person like us.). That's exactly where I stuck up and of course if there is a cheaper way to build it, I want to take the advantage of using it. :)
Again, thanks for all your help...
yes, i would say you are screwed™
Hi,Thanks for the great support ;-) Any other supporters also appreciated :) Regards....