Detecting when files are not HTML when using a httpURLConnection/HTMLEditor

I have created a Spider application that can extract info from a web page using a combination of a httpURLConnection and HTMLEditorKit.ParserCallback

However every so often links to not html pages are found

ie. pdfs,zip files, excel spreadsheets etc...

Would anybody know how I can check the type of a downloaded page without rooting through the headers (I am not sure what is going on in there) ofrchecking the extention of the file, this would end up being a massive list of banned url extentions and would never be fully complete.

Thanks.

[573 byte] By [C_J_a] at [2007-10-3 4:39:40]
# 1

You should actually look at the headers first, IMO. Not knowing "what is going on in there" is a very bad reason to discard that solution. The Content-Type header is meant to indicate exactly the type of content you are getting (your want the text/html mime-type).

Not all servers report the correct content type all the time in this header so it won't be a 100% solution.

BTW, the [url=http://aperture.sourceforge.net/]Aperture project[/url] is specifically intented to extract content and metadata from various sources, including websites. You might be able to use that.

Herko_ter_Horsta at 2007-7-14 22:43:36 > top of Java-index,Java Essentials,Java Programming...