Searching question
I have the following problem to solve, and I hope some kind person has some ideas about how best to tackle this.
I have n sets of numbers, each of which can be an arbitrary size, and the problem is simply this: obtain a list of k numbers which appear in the collection of sets most often, ordered by the number of times they appear.
For instance, if I have the following collection of sets:
{3, 2, 6, 7}, {7, 9, 4, 2, 8}, {2, 4, 5, 11, 12}, {3, 2, 1}
and I want the top 4 most frequent elements, I should get: 2 (which appears four times), 3, 4, 7 (twice each). In the case of a tie, the algorithm is allowed to select elements from the tied set arbitrarily - e.g. if I wanted the two most frequent items in the example above, I would expect 2 followed by either 3, 4 or 7.
Now I guess the 'naive' way to do this would be to scan through each set, counting how many times each item appears (maybe using a hashtable...), sorting the results, and picking out the top k items. Is there a more efficient way of doing this?
Any help much appreciated.
[1100 byte] By [
sampath_s] at [2007-9-30 7:33:03]

Here's some code. Your instructor will immediately know that you didn't write this unless they are a complete idiot. Try to use this as a learning opportunity rather than a free ride.public class Test2 {
public static void main(String[] args) {
// Create the 'set' (really an array of arrays)
int[][] set = {{3, 2, 6, 7}, {7, 9, 4, 2, 8}, {2, 4, 5, 11, 12}, {3, 2, 1}};
// initialize min to the largest possible number & max to the lowest
int min = Integer.MAX_VALUE, max = Integer.MIN_VALUE;
// Find the range
for (int i=0; i<set.length; i++) {
for (int j=0; j><set[i].length; j++) {
if (min>set[i][j]) min=set[i][j];
if (max<set[i][j]) max=set[i][j];
}
}
System.out.println("min="+min+", max="+max); // Print range
// Create counter array and initialize
Count[] counts = new Count[max-min+1];
for (int i=0; i<counts.length; i++) counts[i] = new Count(i+min);
// Count the occurrances
for (int i=0; i<set.length; i++) {
for (int j=0; j<set[i].length; j++) {
counts[set[i][j]-min].increment();
}
}
// Sort the array
java.util.Arrays.sort(counts);
// Print the top 3
for (int i=0; i<3 && i<counts.length; i++) System.out.println(counts[i]);
}
}
class Count implements Comparable {
static final java.util.Random r = new java.util.Random();
int num, count=0;
public Count(int num) { this.num=num; }
public void increment() { count++; }
public int getNum() { return num; }
public int getCount() { return count; }
public int compareTo(Object o) {
if (count>((Count)o).getCount()) return -1;
else if (count<((Count)o).getCount()) return 1;
return r.nextInt(2)*2-1;
}
public String toString() { return "Num="+num+", Count="+count; }
}
">
It seems like the fact that numbers are divided into sets in two levels are irrelevant to the assignment, i.e. {1,2},{3,4} yields just the same result as {1,2,3,4} would do.
The assignment can be solved in a number of ways, some more efficient than others...
The solution purposed by bbritta is not necessarily very efficient:
If m is the total number of integers, and l is the difference between the value of the highest and lowest integer, the performance would be...
Find the range: O(m)
Counter array init: O(l)
Counting: O(m)
Sorting: O(l * log(l))
Total: O(m + l * log(l)) which is only ok if l is small, i.e. if all the numbers are within a small range, unlike numbers like this: {1, 207000024}, { 156, 1}...
Using a hashtable would be efficient independently of the diff between the highest and lowest integer value in the sets.
Sorting at the end is obsolete too. You could just do a select and quick select is O(n), whereas quicksort is O (n* log(n)).
Selecting k numbers from an array of n numbers using quick sort is
O(n* log(n))
Selecting k numbers from an array of n numbers using quick select is
O(n + k * log(k)) which is less, because k is always smaller than n.
Thanks for the responses. Using a select algorithm hadn't occurred to me before, but it makes sense - sorting the whole thing would be somewhat redundant. Presumably the way to go about this is to first use select to find the kth largest element in time O(n), scanning through the set to construct the collection of k elements >= this element, and finally sorting the resulting collection - which is where I presume the O(k*log k) factor comes in.
Presumably there's no way to improve the counting stage though - each element in each set needs to be processed once, so we can do no better than to scan through all of the input sets - is this correct?
Thanks again - much appreciated.
You may select and sort in one pass because quick select and quick sort is essentially the same algorithm:
Quick sort:
1 select a pivot.
2 swap elements that are larger than pivot at the left side with elements that are smaller than pivot at the right side.
3 Quick sort the left side.
4 Quick sort the right side.
In quick select you do either 3 or 4, not both (based on a simple if- test).
Using a combination algorithm would be faster than first selecting then sorting, but the big O complexity would be the same.