Networking - Getting Garbled HTML Using java.net.URL multithreaded

Hi All,

basically i am trying to fetch different pages from the same domain using URL class and then storing this to file systems. but for some of them output comes garbled.

Your assistance is greatly appreciated!

code follows:

publicclass URLFetchTest{

staticint count=0;

publicvoid loadDocument(URL url, String charSet)

throws IOException{

InputStream in =null;

boolean ignoreCharSet =false;

try{

HttpURLConnection urlc = (HttpURLConnection)url.openConnection();

urlc.setRequestMethod("GET");

in = urlc.getInputStream();

Reader reader = (charSet ==null) ?new InputStreamReader(in)

:new InputStreamReader(in, charSet);

FileWriter fw =new FileWriter("a"+count+".html",true);

count++;

char buf[] =newchar[1024];

int readCount =0 ;

while((readCount=reader.read(buf,0,1024))!=-1){

fw.write(buf,0,readCount);

//buf = new char[1024];

//System.out.println(buf);

}

fw.close();

// All done

break;

}catch (ChangedCharSetException e){

break;

}

}

publicvoid loadDocument(URL url)throws IOException{

loadDocument(url,"UTF-8");

}

publicstaticvoid main(String ar[]){

/*

Properties systemSettings = System.getProperties();

systemSettings.put("http.proxyHost", "<PROXY>");

systemSettings.put("http.proxyPort", "<PORT>");

System.setProperties(systemSettings);

*/

String urls[] ={

"http://www.altavista.com/web/results?itag=ody&q=c++&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=java&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=perl&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=cobol&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=fortran&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=ada&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=basic&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=cgi&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=html&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=xml&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=vc++&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=css&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=dhtml&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=oracle&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=mssql&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=mysql&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=db2&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=sybase&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=informix&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=vbasic&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=swing&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=mfc&kgs=1&kls=0",

"http://www.altavista.com/web/results?itag=ody&q=spring&kgs=1&kls=0"

};

for(int i=0;i<urls.length;i++){

final String u=urls<i>;

new Thread(){

publicvoid run(){

try{

new URLFetchTest().loadDocument(new URL(u));

}catch(Exception e){

System.out.println(e);

}

}

}.start();

}

}

}

Thanks,

Sant

[7106 byte] By [SanXa] at [2007-11-26 23:03:44]
# 1

Dear SanX,

Although I didnt run/test your code "busy at moment" but I thought to tell you this, from the first look at your code I think the problem is located in the threads, especially the thread's problem race conditions.

Try to run your code as a single threaded applications and see whether the output is still garbled or not..

Good luck.

Amr M. Kamel

Radi0Shacka at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 2
Hi Radi0Shack ,Thank you for your quick reply,you know actually the code was single threaded and worked fine, but I needed that to be multi threaded for some reason. could you please tell were can i get race conditions ?Sant
SanXa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 3

Hey Sanx,

the problem "still didnt try it so dont talk everything I say for granted :)" in multithreaded applications if u didnt protect your resources from simultaneous access it will lead to alot of problems "in your case grabbled output" because the threads share the same resources not like heavy weight processes where each process has its own memory space. So there must be a method to synchronize access to your shared resources among the threads. Here is a nice tutorial about threads http://java.sun.com/docs/books/tutorial/essential/concurrency/

again goodluck :).

Amr M. Kamel

Radi0Shacka at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 4

Disclaimer:

i am seeing no race conditions, deadlocks, mutually exclusive blocks on this post. so pls try this before giving me links /pointers and be serious in giving your replies(shud not be for the sake of it).

some one have an idea , please give your opinion and for naive guys just into threading concept , forget this question.

Thanks,

Sant

SanXa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 5
SanX,As I told u in previous posts that I didnt try it and I just wanted to help and as I remember you replied back to my post, if you was sure that there was no racing conditions you could just ignored my first post.
Radi0Shacka at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 6
The only shared resource is the 'coun't variable, which should be declared as 'volatile'.But this code won't compile - that 'break' statement is out of place. Is that the real code?
ejpa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 7

ejp,

yes this code is real, pls remove the breaks as this was for getting out of an infinite loop (which i removed for brevity) to catch charsetchangeexception and continue.

also import all the files needed.

you are right , the only shared variable is count , and this cant cause the gibberish!

Thanks,

Sant

Message was edited by:

SanX

SanXa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 8
Well actually it could.You could get two fetches writing to the same file at the same time, as the output file is opened in 'append' mode. Why is that?
ejpa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 9

ejb,

apologize delay in reply. actually the count variable does cause some level for garbage, but the issue is possible even when we synchronize the counter, or assign a separate file name at the start up of threads.

More interestingly , the issue does not come with all server access, ie. I actually found the issue on a CGI based server and later converted the code to test this further.

I did find the solution by executing the url.getconnection ,cookie set and getting the streams in one block (synchronized) and further leaving it open for threads.

Anyway thank all you guys for the interest shown, i really appreciate the forum community

Thanks

SanX

SanXa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 10

You need to make the use and incrementing of the 'count' variable atomic, i.e. new FileOutputStream("a"+(count++)+".html"), otherwise values can get reused and therefore also output filenames. If all the other variables are locally scoped as in the code above I can see no other way in which threads can tread on each other. I also don't understand why you're opening the output files in append mode, and I would remove that too.

ejpa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 11

ejb,

I think you probably didnt get my previous post completely!

append mode and counter problems were rectified long back Thank you! and I paste the solution here

synchronized(URLFetchTestRetest.class) {

urlc = url.openConnection();

// urlc.setRequestMethod("GET");

urlc.setDoOutput(true);

urlc.setRequestProperty("Cookie", "<your cookie>");

in = urlc.getInputStream();

}

SanXa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...
# 12
You need the counter use and incrementing inside that block too!
ejpa at 2007-7-10 13:55:55 > top of Java-index,Core,Core APIs...