A problem imitating the browser http requests using java
Hi,
I'm trying to parse customer reviews in amazon section for academic purpose.
What I'm actually trying to do in my software (without any success) is to get the content of pages that contain the reviews.
I'm able to get the first 10 reviews page but I can't get the next 10.
For each attempt to get them I get the page with the first 10.
For example:
The page with the first 10 which I'm able to get is
http://www.amazon.com/gp/product/customer-reviews/B000LU8A7E/sr=1-1/qid=1182795315/ref=cm_cr_dp_2_1/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=1065836&qid=1182795315&sr=1-1
but the next page (note to the "start 11" string in the url) I can't
which its url is:
http://www.amazon.com/gp/product/customer-reviews/B000LU8A7E/sr=1-1/qid=1182795315/ref=cm_rev_next/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=1065836&s=electronics&customer-reviews.start=11&qid=1182795315&sr=1-1
The code I use is:
URI uri = new URI(currentURL);
URL urlFromUri = uri.toURL();
File urlOutputFile = new File("test.txt");
InputStream is = urlFromUri.openStream();
OutputStream os = new FileOutputStream(urlOutputFile);
int c;
while ((c = is.read()) != -1) {
os.write(c);
}
is.close();
os.flush();
os.close();
where currentURL is for example one of the above. (The problem is with the second url)
Actually what I need to know is how to imitate the HTTP request from my software.
The HTTP sent (monitored by Soft HTTP debugger) looks like this (if that helps):
GET /gp/product/customer-reviews/B000M4J20S/sr=1-1/qid=1182799989/ref=cm_rev_next/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=541966&s=pc&customer-reviews.start=11&qid=1182799989&sr=1-1 HTTP/1.1
Accept: */*
Referer: http://www.amazon.com/gp/product/customer-reviews/B000M4J20S/sr=1-1/qid=1182799989/ref=cm_cr_dp_2_1/102-2890495-8864146?ie=UTF8&customer-reviews.sort%5Fby=-SubmissionDate&n=541966&qid=1182799989&sr=1-1
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Host: www.amazon.com
Connection: Keep-Alive
Cookie: session-id-time=1183359600l; session-id=102-2890495-8864146; ubid-main=002-8791368-6807648; skin=noskin; session-token=GcxDeVM9BA6HDLvWOdsfqd5BYhYiAMDsvHBFtFNsnFBQXQQXXciQE41tDZZuqAV0SL9NBndfrfYLfamyYBkizurD7RKAZsCehBFL71sLoKrE3WY2hRlMV1Ur0TDbeq04fH59Y/fcTp7T1sKs0tiemnxRvAm6pnGRBA5iBrMiFdPLhwOlLRRlJHrkUze+hdI2+Ew3YbOt7Q4=
Any help will be appreciated.

