Downloading HTML (BEGINNER)

I'm trying to make a small web-crawler. Its purpose would be just to download html pages (no pictures, just html) from the given URL.

I've searched through topics and have found a lot, but could use a few tips.

Since I want to download and save to file just a html file, do I have to go through InputStream and FileOutputStream, or is there something better?

I've seen so many different ways to open a connection (httpURLConnection, just url=new URL, post/get methods etc.) ?which would be the most appropriate?

Any other useful suggestions?

[575 byte] By [tyriona] at [2007-10-3 3:47:52]
# 1

Since HTML files are text based you can use BufferedReaders and Writers if you want.

If you want to do a get request all the time with default headers just use URL class if you want more control get the URLConnection from the URL (using openConnection method) and do the stuff.

Also if you want you can go for more sufisticated APIs like Apache HTTP Client.

Also plan how you going to parse the downloaded html files.

LRMKa at 2007-7-14 21:44:51 > top of Java-index,Core,Core APIs...
# 2

Thanks for a reply!

Well, I don't need much of a control here, nothing sofisticated. Just getting one html at a time. So I guess I'll try URL class first.

I already have tried something, using the code I've found here in forum, but I only get first 20-30 lines of html - the rest is just truncated! No matter which web site.

Also, now I'm a bit confused - are you saying that when I use URL class, I don't need URLConnection? Because, my code is something like:

URL url = new URL ("http://www.efa.co.yu/index.htm");

HttpURLConnection conn = (HttpURLConnection)url.openConnection ();

conn.setAllowUserInteraction (false);

conn.setFollowRedirects (true);

conn.setRequestMethod ("GET");

InputStream is = conn.getInputStream ();

byte data[] = new byte [is.available ()];

is.read (data);

FileOutputStream fos = new FileOutputStream (new File ("/file.html"));

fos.write (data);

fos.close ();

conn.disconnect ();

I have a feeling there are many unnecessary things here, according to what you've said.

tyriona at 2007-7-14 21:44:51 > top of Java-index,Core,Core APIs...
# 3
This is not a good thing:byte data[] = new byte [is.available ()];
BIJ001a at 2007-7-14 21:44:51 > top of Java-index,Core,Core APIs...
# 4

available method only return the numebr of bytes that you can read without blocking. It does not say anything about the amount of data that you have to read.

Also read method does no garantee to fill the entire array.

You should read an write in a loop until you hit the EOF

int i;

byte b[] = new btye[1024*64]; //A 64lk buffer

while( (i = is.read(b)) != -1)

fos.write(b,0,i);

is.close();

fos.close();

LRMKa at 2007-7-14 21:44:51 > top of Java-index,Core,Core APIs...
# 5
Thanks everybody!That will do the trick! And, I've learned a good deal!Thanks again!
tyriona at 2007-7-14 21:44:51 > top of Java-index,Core,Core APIs...