What is the best approach to trying to find high freq hits in a file?

Lets say I have a text document that has millions of rows of information like "Name, address, last time checked in:"What is the best approach if I were to look for the top 5 people who appears the most on this huge list? Thanks!
[249 byte] By [shaselaia] at [2007-11-27 6:45:15]
# 1
The answer is..use a database. Or when this file is being updated you could have updated another file with stats but it may be too late for this. Another option might be to break the file in to smaller pieces.Message was edited by: _helloWorld_
_helloWorld_a at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 2

"Best" depends on your specific situation and requirements.

Simplest would be to maintain a Map of name to count, read line by line, and for each name, if it's already in the map, increment its count, and if not, add it to the map with a count of one

More scalable and possibly more performant would be to stick the whole thing in a database and let it find them for you.

jverda at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 3
If this is some sort of log file for a website then I guess should should have created weekly/monthly (depending on your traffic) log files which would have made something like this easier to tackle. A DB is certainly the best way though.
_helloWorld_a at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 4

If it is not in a database and it's just one file with all those data, what approach would be good in the realm of Java? Would Map still be the best choice? Would the complexity be n^2 since you would need to put everything in, then compare all the sizes?

Thanks!

I guess another example might be webhits - which is more based on how many an IP hit than the name. But either way its going to be just one file.

Message was edited by:

shaselai

shaselaia at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 5

Where is the file coming from? If you are able to sort the contents of the file, like when dumping from a database table, it will make your job a lot easier. You can use the sorted data to only have to keep track of certain entries in the file, rather than having to try to maintain millions of entries and run the risk of running out of memory.

flyingmulleta at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 6

> If it is not in a database and it's just one file

You can still put it into a DB.

> with all those data, what approach would be good in

> the realm of Java?

I thought I already said that.

> Would Map still be the best

> choice?

Simplest? Probably. Best? Only you can determine that.

> Would the complexity be n^2 since you would

> need to put everything in, then compare all the

> sizes?

No, it should be O(2N) (which is really O(N)). Inseting into the map is O(N), and then iterating once over the entries and adjusting your running top 5 is O(N).

jverda at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 7

Is there some sort of unique identifier? Because you could have more than one person with the same name which would mean you would have to check the address to for uniqueness. As Jverd says a map and a count is probably the only way left with what you have. Use some unique key like there first name, last name and house number (which still has some problems) and read them into a buffer reader adding a new object to the map for each new name and increasing a count when the same name/identifier appears more than once.

_helloWorld_a at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 8

> track of certain entries in the file, rather than

> having to try to maintain millions of entries and

> run the risk of running out of memory.

That depends on the count profile. If it's a few million names, each repeated just a few times, then yeah, memory could be a problem. If it's only a few thousand names, with the heavy hitters occurring thousands of times, then shouldn't be an issue.

jverda at 2007-7-12 18:16:57 > top of Java-index,Java Essentials,Java Programming...
# 9

> That depends on the count profile. If it's a few

> million names, each repeated just a few times, then

> yeah, memory could be a problem. If it's only a few

> thousand names, with the heavy hitters occurring

> thousands of times, then shouldn't be an issue.

Agreed. I don't know how many repeat entries there are so I was playing it safe.

flyingmulleta at 2007-7-12 18:16:58 > top of Java-index,Java Essentials,Java Programming...
# 10

> > That depends on the count profile. If it's a few

> > million names, each repeated just a few times,

> then

> > yeah, memory could be a problem. If it's only a

> few

> > thousand names, with the heavy hitters occurring

> > thousands of times, then shouldn't be an issue.

>

> Agreed. I don't know how many repeat entries there

> are so I was playing it safe.

Hence my hand-waving, wuss-out loophole in my first reply: "Best" depends on the specific situation and requriements. :-)

jverda at 2007-7-12 18:16:58 > top of Java-index,Java Essentials,Java Programming...
# 11
I am not sure exactly how much integrity is required but you could also consider processing a portion of the results and comparing the figures. It really depends on the nature of what you are doing I guess.
_helloWorld_a at 2007-7-12 18:16:58 > top of Java-index,Java Essentials,Java Programming...
# 12
> Hence my hand-waving, wuss-out loophole in my first> reply: "Best" depends on the specific situation and> requriements. :-)Experience :-)Probably gained through a lot of meetings with the business side of your company.
_helloWorld_a at 2007-7-12 18:16:58 > top of Java-index,Java Essentials,Java Programming...
# 13
I see. I guess any Map will do? Like a HashtableSo what if there are like 5 hundred thousand of names in the couple million entries? Would a Map still be a good idea or there's something else in Java that might do the trick?
shaselaia at 2007-7-12 18:16:58 > top of Java-index,Java Essentials,Java Programming...
# 14
Use a HashMap, a HashTable is thread safe and will be a bit slower.
_helloWorld_a at 2007-7-12 18:16:58 > top of Java-index,Java Essentials,Java Programming...
# 15

> I see. I guess any Map will do? Like a Hashtable

>

HashMap would be preferable.

> So what if there are hundreds of thousand of names in

> the couple million entries? Would a Map still be a

> good idea or there's something else in Java that

> might do the trick?

Depends how much memory you can give to your VM, how many names, and how long the names are.

100,000 names * 10 chars per name * 2 bytes per char is 2,000,000 bytes to store the names. There'll be other overhead, but the names themselves should dominate. That's 2 MB. Almost nothing unless you're in a J2ME environment.

jverda at 2007-7-21 22:04:33 > top of Java-index,Java Essentials,Java Programming...
# 16

Ok. What is the best approach after I insert all the unique addresses? I guess i have something like :

if (key not in table) {

table.add(key,counter);

uniqueIPList.add(key);

} else {

table.get(key)++;

}

now from above I should have a Hashtable of unique Ips and their frequencies plus an array of unique ips for keys. What is the best/efficient way to find the top 10 frequencies? Do I use a size 10 array and keep adding/removing values while I iterate the entire map?

Would the complexity be N(for insertion) + 10*N(10 for looking through array and N for iteration) = N?

Sorry I am a little rusty on complexities.. thanks!

shaselaia at 2007-7-21 22:04:33 > top of Java-index,Java Essentials,Java Programming...
# 17
Continued here: http://forum.java.sun.com/thread.jspa?threadID=5181477
jverda at 2007-7-21 22:04:33 > top of Java-index,Java Essentials,Java Programming...