Servlet as proxy and HTML filter
This is the scenario: A client sends a request for an external web page (e.g. www.cnn.com) to a servlet. The servlet is supposed go out and get this page. The received page should then be filtered depending on the user's session parameters.
Is this possible at all? Maybe with a combination of Apache and Tomcat? I would really appreciate your help!
Thanks,
Peter
[389 byte] By [
petelockea] at [2007-10-2 10:54:00]

> Is this possible at all?
Sure it is. Technically, that is. Whether it's legal or not is a separate issue.
> Maybe with a combination of Apache and Tomcat?
Nope. Apache is a web server, Tomcat is a web application server. The servlet that you write could run in Tomcat if you liked, but it's up to you to write the code. Tomcat's only responsibility is to run it.
Thanks for the hint! However, I played around a little bit and discovered the next problem. Basically, HttpClient gets the source code of the requested page how it would look like in the browser. If I send this back to the user/client, all links (e.g. to images) are bad, because they are relative to the original server. Is there a fix for that?
> Two (of many possible) ways:
>
> > Use HttpURLConnection to fetch the data yourself.
>
> > Download Jakarta's HttpClient and use that to fetch
> the data (my preference).
>
> How you filter and under what criteria are presumably
> what make your system unique.
>
> - Saish
Hmm.... not a trivial problem. Two high level options I can think of:
> If your app will always proxy to a given site, then you should be able to inspect a URL from the browser that is requesting additional files (from your site, since that is where the relative URL would evaluate to). You would then dynamically request the resource from the actual site. This would only work if the URL's request a limited number of target hosts.
> If your app proxies to many sites, you may have to associate a session token with the browser. When the browser makes requests, you first fetch the HTML. Then store the host. On subsequent client requests with the right session id, perform the requests yourself (fetching the host stored earlier) and then re-write the responses sent back (RegEx may work well in this role).
Option #2 is truly complicated. Option #1 is probably doable.
Your final option is to install a true proxy, such as Squid, and see what filtering capabilities it itself offers.
- Saish