"Best" depends on your specific situation and requirements.
Simplest would be to maintain a Map of name to count, read line by line, and for each name, if it's already in the map, increment its count, and if not, add it to the map with a count of one
More scalable and possibly more performant would be to stick the whole thing in a database and let it find them for you.
If it is not in a database and it's just one file with all those data, what approach would be good in the realm of Java? Would Map still be the best choice? Would the complexity be n^2 since you would need to put everything in, then compare all the sizes?
Thanks!
I guess another example might be webhits - which is more based on how many an IP hit than the name. But either way its going to be just one file.
Message was edited by:
shaselai
Where is the file coming from? If you are able to sort the contents of the file, like when dumping from a database table, it will make your job a lot easier. You can use the sorted data to only have to keep track of certain entries in the file, rather than having to try to maintain millions of entries and run the risk of running out of memory.
> If it is not in a database and it's just one file
You can still put it into a DB.
> with all those data, what approach would be good in
> the realm of Java?
I thought I already said that.
> Would Map still be the best
> choice?
Simplest? Probably. Best? Only you can determine that.
> Would the complexity be n^2 since you would
> need to put everything in, then compare all the
> sizes?
No, it should be O(2N) (which is really O(N)). Inseting into the map is O(N), and then iterating once over the entries and adjusting your running top 5 is O(N).
Is there some sort of unique identifier? Because you could have more than one person with the same name which would mean you would have to check the address to for uniqueness. As Jverd says a map and a count is probably the only way left with what you have. Use some unique key like there first name, last name and house number (which still has some problems) and read them into a buffer reader adding a new object to the map for each new name and increasing a count when the same name/identifier appears more than once.
> track of certain entries in the file, rather than
> having to try to maintain millions of entries and
> run the risk of running out of memory.
That depends on the count profile. If it's a few million names, each repeated just a few times, then yeah, memory could be a problem. If it's only a few thousand names, with the heavy hitters occurring thousands of times, then shouldn't be an issue.
> That depends on the count profile. If it's a few
> million names, each repeated just a few times, then
> yeah, memory could be a problem. If it's only a few
> thousand names, with the heavy hitters occurring
> thousands of times, then shouldn't be an issue.
Agreed. I don't know how many repeat entries there are so I was playing it safe.
> > That depends on the count profile. If it's a few
> > million names, each repeated just a few times,
> then
> > yeah, memory could be a problem. If it's only a
> few
> > thousand names, with the heavy hitters occurring
> > thousands of times, then shouldn't be an issue.
>
> Agreed. I don't know how many repeat entries there
> are so I was playing it safe.
Hence my hand-waving, wuss-out loophole in my first reply: "Best" depends on the specific situation and requriements. :-)
> I see. I guess any Map will do? Like a Hashtable
>
HashMap would be preferable.
> So what if there are hundreds of thousand of names in
> the couple million entries? Would a Map still be a
> good idea or there's something else in Java that
> might do the trick?
Depends how much memory you can give to your VM, how many names, and how long the names are.
100,000 names * 10 chars per name * 2 bytes per char is 2,000,000 bytes to store the names. There'll be other overhead, but the names themselves should dominate. That's 2 MB. Almost nothing unless you're in a J2ME environment.
Ok. What is the best approach after I insert all the unique addresses? I guess i have something like :
if (key not in table) {
table.add(key,counter);
uniqueIPList.add(key);
} else {
table.get(key)++;
}
now from above I should have a Hashtable of unique Ips and their frequencies plus an array of unique ips for keys. What is the best/efficient way to find the top 10 frequencies? Do I use a size 10 array and keep adding/removing values while I iterate the entire map?
Would the complexity be N(for insertion) + 10*N(10 for looking through array and N for iteration) = N?
Sorry I am a little rusty on complexities.. thanks!