Understanding structure of a URL
I need to extract the "domain name" part of a URL.
For example given the URL:
http://forum.java.sun.com/
I need to extract "sun" from the URL. However this is not easy as using java's inbuilt getHost() function, returns "forum.java.sun.com" - more than I need.
One approach I was looking at was to delete anything that occurs after the last dot to remove the ".com" however what happens when you have a URL ending in ".co.uk" (two dots)? And how could you distinguish between the actual domain name and a subdomain?
Basically I have thought about the problem for a while and am not getting anywhere fast, has anyone else tried to solve such a problem?
Any help is appreciated,
thanks
[742 byte] By [
Mylesa] at [2007-11-27 11:50:08]

http://en.wikipedia.org/wiki/Domain_name
Read that, especially the section about "Other-level domains". That section essentially says that you don't have a well-defined problem.
Your example of "co.uk" is just one example. Last month I was in the UK and I noticed on the back of a bus that it's possible to have a subdomain of the "uk.com" domain for your website, something like http://www.bigbuscompany.uk.com/ (that wasn't the real URL, I don't remember what it was).
Also along those lines, there's http://www.las-vegas.us.com/ which I just found.
So basically there is no easy way of solving this problem?
The only logical way i can think of is to have a database of all the possible tld's for every country and testing a URL against all known tld's. However this is not desirable as i'm sure new tld's are introduced periodically.
Mylesa at 2007-7-29 18:29:28 >

But wouldn't abc.uk.com be a subdomain abc in the uk.com domain name. It was never clear to me in the case of co.uk that combined it is a top-level domain. I always thought it was just .uk, and the co part co.uk was the domain name, and anything above that was a subdomain.
Generally speaking, I would refer to the domain name as the last 2 parts. Thus the .com or .uk is the top-level domain, the co.uk or sun.com is the "domain name" (you really need the top-level part also, IMO, otherwise you can run into collisions.
> But wouldn't abc.uk.com be a subdomain abc in the
> uk.com domain name. It was never clear to me in the
> case of co.uk that combined it is a top-level domain.
> I always thought it was just .uk, and the co part
> co.uk was the domain name, and anything above that
> was a subdomain.
>
co.uk is a TLD.
http://wiki.mozilla.org/TLD_List
@OP the only way I see this working at all is that you have a list of TLDs and you remove that part first.
Then remove anything before what remains because that's just a host or subdomain name.
I am wondering though about why you would want this information. Because really a domain name is both the sun and the com. The TLD is part of the domain name...
Well, good luck anyway.
I'll briefly explain what i am trying to do. I am writing a web crawler which downloads a page and saves it in a directory. The location it is stored depends on the URL of the page downloaded.
For example if you had the URL: http://forum.java.sun.com
I want it to be stored at: directory\com\s\u\n\java\forum\thePage.gzip
I have a naive outline of the code written which processes "obvious" URL's but it is proving very difficult to make it robust enough for "unknown" domain structures.
As previously suggested I may just remove the last part of the host name (anything after the last dot) and take the next two part as the domain i.e java.co.uk would be "java.co", however what happens if you had sun.java.com you cant just take off the .com part and have sun.java as the domain as it isn't correct, "java" is the domain!
I think I may need to resort to chopping off the last part of the hostname after the last dot and just use whats left i.e. the second level domain and all sub domains (if they exist)
Message was edited by:
Myles
Mylesa at 2007-7-29 18:29:28 >

> I'll briefly explain what i am trying to do. I am
> writing a web crawler which downloads a page and
> saves it in a directory. The location it is stored
> depends on the URL of the page downloaded.
>
> For example if you had the URL:
> http://forum.java.sun.com
>
> I want it to be stored at:
> directory\com\sun\java\forum\thePage.gzip
>
This seems alot of work for something that you don't really need to do.
I would suggest doing it more like browsers do it (for cacheing).
Take the page - get a hash for it - then make a map mapping the full url to the hash. Store all the files in a directory using the hash as a name. Store the map as an index somewhere. It could even really be a properties file.
> This seems alot of work for something that you don't
> really need to do.
>
> I would suggest doing it more like browsers do it
> (for cacheing).
>
> Take the page - get a hash for it - then make a map
> mapping the full url to the hash. Store all the files
> in a directory using the hash as a name. Store the
> map as an index somewhere. It could even really be a
> properties file.
I am going to be storing thousands and thousands of pages, would this implementation make locating a file harder (take longer)? I liked the idea of have a structured mechanism for saving each page by URL, rather than having every page saved in a single folder.
However having said that, if your idea would be just as efficient, could you explain it a little further? I'm not too familiar with hashsets etc - is there any chance of two pages being given the same hash for example?
Thanks
Mylesa at 2007-7-29 18:29:28 >

> > This seems alot of work for something that you
> don't
> > really need to do.
> >
> > I would suggest doing it more like browsers do it
> > (for cacheing).
> >
> > Take the page - get a hash for it - then make a
> map
> > mapping the full url to the hash. Store all the
> files
> > in a directory using the hash as a name. Store the
> > map as an index somewhere. It could even really be
> a
> > properties file.
>
>
> I am going to be storing thousands and thousands of
> pages, would this implementation make locating a file
> harder (take longer)? I liked the idea of have a
> structured mechanism for saving each page by URL,
> rather than having every page saved in a single
> folder.
>
> However having said that, if your idea would be just
> as efficient, could you explain it a little further?
> I'm not too familiar with hashsets etc - is there any
> chance of two pages being given the same hash for
> example?
>
> Thanks
As a guesstimate it would be faster. Because you would be using a Map of keys (URLs) to values (files).
By hash you could just get the hashcode for the object. It is possible to have collisions (duplicates) but if you used something like the URL String they are unlikely to happen unless the URL is actually the same.
Even so you could check for collisions and if they exist then just append a 1 or something to the end of the hash (as a String). Prob sol. Because you will never use the hash again. After the file is created the hash is stored only as the value where the URL is used as the key.
> But wouldn't abc.uk.com be a subdomain abc in the uk.com domain name.
Yes, it would. But my guess is that Myles would want to extract the "abc" part and not the "uk" part.
> As a guesstimate it would be faster. Because you
> would be using a Map of keys (URLs) to values
> (files).
>
> By hash you could just get the hashcode for the
> object. It is possible to have collisions
> (duplicates) but if you used something like the URL
> String they are unlikely to happen unless the URL is
> actually the same.
>
> Even so you could check for collisions and if they
> exist then just append a 1 or something to the end of
> the hash (as a String). Prob sol. Because you will
> never use the hash again. After the file is created
> the hash is stored only as the value where the URL is
> used as the key.
So if I understand you:
1. You take the URL and generate a hashcode (how? - provide a link please)
Question: if every url is unique (which it should be as URLs for different pages, by nature, are unique!) shouldn't the hashcode generated always be unique - or is this not the case?
2. Once you have the hashcode for the URL you save the page in a file with the hashcode as the filename?
3. Store a map to the file somewhere?
Would you have to store a map file somewhere which contained the hashcode for a given URL as I would need to find stored pages in the future - given their URL
Thanks, you'll need to forgive my ignorence when it comes to hash' s maps etc - I have never used them before.
Message was edited by:
Myles
Mylesa at 2007-7-29 18:29:28 >

> I have a naive outline of the code written which
> processes "obvious" URL's but it is proving very
> difficult to make it robust enough for "unknown"
> domain structures.
Since you're crawling anyway, a DNS lookup isn't a massive overhead.
use class A, B and C to determine how many parts you should chop off.
> > But wouldn't abc.uk.com be a subdomain abc in the
> uk.com domain name.
>
> Yes, it would. But my guess is that Myles would want
> to extract the "abc" part and not the "uk" part.
It seems to me that it's pointless to worry about whether co.uk is a TLD or just the uk. Cuz it seems to me that your structure still applies regardless.
Otherwise, just break it up such that each part is a "directory" and not worry about breaking up the 2nd level into characters.
Or do the hashing thing, as a caching scheme goes, I think it's better (unless you really need users to be able to browse the cache directly).
Thanks for all your replies, I think this is what I am going to do:
1. Take the URL and generate its hashcode. Something like:
URL theUrl = new URL("http://forum.java.sun.com");
long pageHashCode = theUrl.toExternalForm().hashCode());
// Save file on disk with filename = pageHashCode
Then when I wish to access the page in the future I simply take the page's URL, generate its hashcode then directly access the file name. I dont need to store any "map" do I?
One question though, am I guaranteed that no two different URL's will generate the same hashcode? Even between different JVM's?
Lastly is everyone satisifed that this approach is better than creating a "structured" directory of pages built using their URL's?
Mylesa at 2007-7-29 18:29:28 >

> One question though, am I guaranteed that no two
> different URL's will generate the same hashcode? Even
> between different JVM's?
No, you aren't guaranteed that. If you think about it you will see that the number of possible URLs vastly exceeds the number of possible ints.
That's why cotton.m suggested using the hashcode as a directory name and not a file name, I believe.
Would Object.hashCode() be a good choice? Maybe a more standard hashing algorithm like SHA or something?
Having read a little regarding hashcodes I have decided not to use such an implementation as its not foolproof and there is a chance I could encounter problems later on. Therefore I opted to simply substitute full stops in URL's for new directories, e.g:
The URL:
http://spaceflight.nasa.gov/station/index.html
Gets stored at:
E:\Search Engine\Archive\.gov\nasa\spaceflight\spaceflight.nasa.gov\stationindex.html.gzip
For what I need this will suffice as I can be certain every page will have a unique path to be saved. It also has a few other benefits like saving the original URL of the page in its hierachy structure amongst others.
I think its the best solution for an otherwise awkward problem.
Thanks for all your help
Mylesa at 2007-7-29 18:29:33 >
