> Are there other methods than to look at the attribute
> "HREFLANG" to understand in which langage a web site
> is?
Never heard of that attribute. Typically, the content type is defined either a response header and/or a META tag. The language could still be defined as a separate header. Depending on what you are looking for, the content-type is needed to read the content correctly from your program, and does not necessarily have anything to do with the language.
> Secondly, is there a direct method in java that
> handles this?
Only to read the header from a URLConnection.
> Lastly, can I always rely on these?
No, not really. It can always have the wrong info.
Thanks bsampieri,
From what understood tha language could or couldn't be specified and it can lead me wrong, am I right?
And can you specify this a little bit more?
> Never heard of that attribute. Typically, the
> content type is defined either a response header
> and/or a META tag. The language could still be
> defined as a separate header. Depending on what you
> are looking for, the content-type is needed to read
> the content correctly from your program, and does not
> necessarily have anything to do with the language.
I have to nderstand the general language o the web sites because I have project in which I have to sepeateonlyturkish sites. If you can tell me how to ue you info, or any oher options, I will be glad.
Thanks again...
Hi paulcw,
That is exactly what I am trying to do. However my spider should only download Turkish web sites. So somehow I have to understand in which language the web site is (at least guess with a high correlation). Please ask anything that is not clear and very much thanks for your help.
Regards....
Message was edited by:
protenniser
Then it looks like bsampieri's post should help.
Probably what you want, then, is to do a HEAD connection onto a server in question and check its response headers to see if the language is Turkish.
Things get tricky after that, because sites can house multi-lingual context. I suspect that HTML tags wouldn't be very useful, except when processing an index page to see if it has links to Turkish content.
You should probably check out the HTTP specification (it's an RFC) to see which server response headers can tell you the language of the site. You should also check to see if there are client request headers to ask for content of a particular language. So we're talking about language-oriented content negotiation in the HTTP headers only at this point.
As far as I'm aware, there's no standard to look at web page content to determine available languages on a site. But that doesn't mean there isn't one. But this isn't really a Java question yet; it's a web standards question.
Hi paulcw,
You are absolutely right, indead it is a web question but the problem is I have to code (or integrate the code) in java. So I am trying to benefit whatever I can from the "pre-written" methods. From what I understand from your post;
1st: I can't rely on header attributes since they can be misleading/not written etc.
However there is point that I can't understand both from your and bsampieri's post: (Please forgive me for my level of knowledge in the we concept.)
> Probably what you want, then, is to do a HEAD
> connection onto a server in question and check its
> response headers to see if the language is Turkish.
What is a HEAD connection? How can I do it? I think it is different than the header.?
> Things get tricky after that, because sites can house
> multi-lingual context. I suspect that HTML tags
> wouldn't be very useful, except when processing an
> index page to see if it has links to Turkish
> content.
If the page has multi language context what difficulties can I face? Are they about HEAD connection?
> You should probably check out the HTTP specification
> (it's an RFC) to see which server response headers
> can tell you the language of the site. You should
> also check to see if there are client request headers
> to ask for content of a particular language. So
> we're talking about language-oriented content
> negotiation in the HTTP headers only at this point.
What is a server response, response header?
Again please forgive me about my low level of knowledge in this concept. And if there is a tutorial/web page that I can read / answers my questions, I will greatly accept that too. ?am really trying to learn this. I know I have tired you a lot so thanks for all effort.
Regards...
All your questions are basically about HTTP, which is the basic protocol by which web content is exchanged.
Here's a link about HTTP:
http://www.w3.org/Protocols/rfc2616/rfc2616.html
But you can probably find a tutorial somewhere online, possibly on w3.org. And there are books about it as well.