A problem imitating the browser http requests using java

Hi,

I'm trying to parse customer reviews in amazon section for academic purpose.

What I'm actually trying to do in my software (without any success) is to get the content of pages that contain the reviews.

I'm able to get the first 10 reviews page but I can't get the next 10.

For each attempt to get them I get the page with the first 10.

For example:

The page with the first 10 which I'm able to get is

http://www.amazon.com/gp/product/customer-reviews/B000LU8A7E/sr=1-1/qid=1182795315/ref=cm_cr_dp_2_1/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=1065836&qid=1182795315&sr=1-1

but the next page (note to the "start 11" string in the url) I can't

which its url is:

http://www.amazon.com/gp/product/customer-reviews/B000LU8A7E/sr=1-1/qid=1182795315/ref=cm_rev_next/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=1065836&s=electronics&customer-reviews.start=11&qid=1182795315&sr=1-1

The code I use is:

URI uri = new URI(currentURL);

URL urlFromUri = uri.toURL();

File urlOutputFile = new File("test.txt");

InputStream is = urlFromUri.openStream();

OutputStream os = new FileOutputStream(urlOutputFile);

int c;

while ((c = is.read()) != -1) {

os.write(c);

}

is.close();

os.flush();

os.close();

where currentURL is for example one of the above. (The problem is with the second url)

Actually what I need to know is how to imitate the HTTP request from my software.

The HTTP sent (monitored by Soft HTTP debugger) looks like this (if that helps):

GET /gp/product/customer-reviews/B000M4J20S/sr=1-1/qid=1182799989/ref=cm_rev_next/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=541966&s=pc&customer-reviews.start=11&qid=1182799989&sr=1-1 HTTP/1.1

Accept: */*

Referer: http://www.amazon.com/gp/product/customer-reviews/B000M4J20S/sr=1-1/qid=1182799989/ref=cm_cr_dp_2_1/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=541966&qid=1182799989&sr=1-1

Accept-Language: en-us

Accept-Encoding: gzip, deflate

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

Host: www.amazon.com

Connection: Keep-Alive

Cookie: session-id-time=1183359600l; session-id=102-2890495-8864146; ubid-main=002-8791368-6807648; skin=noskin; session-token=GcxDeVM9BA6HDLvWOdsfqd5BYhYiAMDsvHBFtFNsnFBQXQQXXciQE41tDZZuqAV0SL9NBndfrfYLfamyYBkizurD7RKAZsCehBFL71sLoKrE3WY2hRlMV1Ur0TDbeq04fH59Y/fcTp7T1sKs0tiemnxRvAm6pnGRBA5iBrMiFdPLhwOlLRRlJHrkUze+hdI2+Ew3YbOt7Q4=

Any help will be appreciated.

[2782 byte] By [DDTa] at [2007-11-27 8:47:45]
# 1

You don't have to imitate an HTTP request; the HTTP-related stuff in java.net are HTTP requests. What you're really asking is either how to provide non-HTTP-specific browser functionality that Amazon requires, or how to spoof browser characteristics that some web sites require (like particular User-Agent strings). I doubt that Amazon would require any of the latter; they're pretty good that way.

Anyway... you're going about this the wrong way anyway. Don't try to HTML-scrape their normal web site; use their web services site, in particular the E-Commerce web services. These will give you all the data that's available on their regular web site, but formatted for arbitrary programs, not just web browsers. It's great, and really easy. They have both heavy-duty web services and REST. (Use REST, it's HTTP as it was meant to be.)

paulcwa at 2007-7-12 20:53:23 > top of Java-index,Java Essentials,Java Programming...
# 2
Oh, and it's free, too. You have to register but it's free. (The E-commerce web services anyway. They have some other services that you have to pay for.)
paulcwa at 2007-7-12 20:53:23 > top of Java-index,Java Essentials,Java Programming...