Best data structure to maintain top N elements
Hello,
The problem at hand is the following. I need to run 1 million search and return back only top 100 scores. I do not wish to store all million scores, sort them and only return top 100. I'd like to maintain a "list" of top 100 scores. As new search produces a score, I'd like to very quickly see if the resulting score is greater than lowest guy in top 100 and if so insert him in proper place. At the same time, removing the lowest guy from the list, thus maintaining top 100. The "list" should be sorted, I am guessing, for fast searching, insertion,deletion. Can someone please recomment the best data structure / algorithm to use in this situation? I am new at this and do not have much experience. Thank you in advance.
[744 byte] By [
toyrunra] at [2007-10-2 10:24:43]

Use a SortedSet (TreeSet) to hold the 100, or else just a List (LinkedLit or ArrayList) that you sort each time you add new element.Maintain a separate variable for lowest high score, or just compare each score against the first or last element of the Set or List.
[url= http://www.onjava.com/lpt/a/3286]Making Java Objects Comparable[/url] http://java.sun.com/docs/books/tutorial/collections/interfaces/order.html http://www.javaworld.com/javaworld/jw-12-2002/jw-1227-sort.html
A traditional method to pull and retain the top N elements in some collection is to keep a heap of N elements (otherwise known as a priority queue - the idea being that the high priority elements move to the front of the queue so that you can quickly remove high priority items)
You define the priority of elements in this queue to be those with LOWER scores, because you want the low scoring folks to move to the front of the queue where they can be thrown out quickly.
You toss the first N elements, in your case 100, into the queue, then for each new element, if the new guy is worse than or equal to the top of the queue you skip it, otherwise you remove the top of the queue and insert the new guy.
when you are finished, the top N elements will be in the heap. You can then remove them from the heap. Because the heap orders them from LOWEST, they first one you pull from the heap will be the element that belongs in slot 100 of your list of the top 100.
The advantage of using a heap or priority queue over keeping a sorted list of the top 100 is that the insertions and deletions from the heap as you go along are a little quicker than doing the same thing to a sorted list
Of course, if you need to have a sorted list of the top 100 that you have seen so far, at every moment as you go down the list of a million, keeping the sorted list is just as efficient.
TreeSet turned out to be the fastest struct to use in this case. It maintains only top 100 respondents, sorts it only on ADD, root is always the lowest score, so only 1 check is needed to see if new element will be inserted. All operations are LOG(n). Thanks guys.
I question your assertion that the least element in a tree set is the root of the tree. If the tree is balanced as it should be for O(log(n)) performance, then some middle element should be at the root of the tree and the least element is all the way down the left branch.
If the TreeSet class has been written to cache the least element then you can get it in O(1) time. If not, you are paying O(log(n)) time to fetch the least element.
Granted, since in your case you have n fixed at 100, this is also a constant time. It is just larger than it would be if you cache the smallest element as jverd first suggested. As you pointed out, you will be doing that comparison a million times even if you only do adds a small fraction of that time.
On the other hand, as soon as it's fast enough, you're done.
> Hello,
>
> The problem at hand is the following. I need to run 1
> million search and return back only top 100 scores. I
> do not wish to store all million scores, sort them
> and only return top 100. I'd like to maintain a
> "list" of top 100 scores. As new search produces a
> score, I'd like to very quickly see if the resulting
> score is greater than lowest guy in top 100 and if so
> insert him in proper place. At the same time,
> removing the lowest guy from the list, thus
> maintaining top 100. The "list" should be sorted, I
> am guessing, for fast searching, insertion,deletion.
> Can someone please recomment the best data structure
> / algorithm to use in this situation? I am new at
> this and do not have much experience. Thank you in
> advance.
Store them?
Like in memory?
So you want to keep 100 entries, not 1 million in memory. And when someone adds a new one you either put it in that list or discard it (since putting it anywhere else would be storing it.)
You can do it absolutely any way that you want - 100 is so small that nothing you do is going to significantly impact the perception of how fast it is.
