Weighted Sorting
Hey guys,
I'm looking at this particular problem of weighted sorting and group inference. Maybe someone can help me devise an efficient and robust algorithm for it.
The scenario is that I have an array (or any other data collection for that matter) of string values. For example:
a[1] ="home/public/file1"
a[2] ="home/ajax/file2"
a[3] ="home/math/file3"
a[4] ="home/ajax/file4"
a[5] ="home/ajax/file5"
:
:
a[n] ="..."
Now what I want to derieve from the base collection is a set of similar string values. If I sort the array, the values can be grouped but what would be the best solution for filtering/fetching a group of similar values, such that the above example would render a new filtered collection based on weighted sorting:
a[1] ="home/ajax/file2"
a[2] ="home/ajax/file4"
a[3] ="home/ajax/file5"
:
:
a[n] ="..."
I hope the problem description is clear enough. I don't know how else to refer to it except for "weighted sorting". Maybe there's a formal term for it based on lexicography or heuristic patterns. Maybe regular expressions can be applied to it as well.
Maybe you guys have some thoughts or suggestions on solution.
[1517 byte] By [
NickRicea] at [2007-10-2 0:16:06]

> I hope the problem description is clear enough. I
> don't know how else to refer to it except for
> "weighted sorting". Maybe there's a formal term for
> it based on lexicography or heuristic patterns.
Have you tried String.compareToIgnoreCase(String str);? The String API says this: "Compares two strings lexicographically, ignoring case differences", which looks like your intention. Or am I missing something?
Thanks for your reply.
I'm not actually looking for string comparison but rather a filtering of array elements based on the common prefix. So basically the algorithm should take an array of string elements as input and render a new filtered array of string elements which have a similar prefix.
It's not at all obvious what you want to do. I gather that it's something to do with the most common prefix, but it's clearly more constrained than that because the most common prefix is "". Could you try explaining it differently?
Doing a simple sort on the Strings would produce the result as you have described it. So you need to be more specific about what it is you need to see that differs from a standard sort.
I guess the earlier example is not the best, but I'll try explaining with another one:
Input Array (unsorted, 8 elements):
a[1] = "/home/dir1/file1"
a[2] = "/home/dir2/file3"
a[3] = "/home/dir1/file3"
a[4] = "/home/dir2/file2"
a[5] = "/home/dir1/file2"
a[6] = "/home/dir1/file4"
a[7] = "/home/dir2/file1"
a[8] = "/home/dir3/file2"
Output Array (sorted 4 elements):
b[1] = "/home/dir1/file1"
b[2] = "/home/dir1/file2"
b[3] = "/home/dir1/file3"
b[4] = "/home/dir1/file4"
The output array now holds the most number of closely similar string elements. Based on a prefix pattern lookup the algorithm groups and filters the most similar string values (since /home/dir1 has the most number of prefix occurences). Makes sense guys?!?!
I've also been looking at Regular Expressions to solve this by utilizing the backreference construct. But I still haven't made any progress so far.
> Based on a prefix pattern lookup the algorithm groups and filters the most similar string values (since /home/dir1 has the most number of prefix occurences).
But why must the prefix by "/home/dir1" rather than "/home/dir" or "/home", both of which have 8 matches? Is there a constraint that the part of the string after the prefix may not contain "/"? If so, surely you just need to take the input strings, strip off the trailing non-/ characters and count repetitions of the remaining strings.
That's the idea: the prefix will be the substring till the last slash. But even in that scenario comparing a prefix for each array element with each other, and then constructing a weighted map of some sort will be very inefficient (consider a 500 elements array - that's 500x500 = 250000 iterations for comparison). Isn't it?
There has to be a more efficient way to do that.
No, its linear since as YAT_Archivist said, you first stripaway the last non '/' then use a map to count duplicates.
parza at 2007-7-15 16:17:28 >

I really don't think this is a hard problem.
Start out with a skelaton for a comparator-- define a class that implements java.util.Comparator. In your compare method, tokenize (using perhaps java.util.Tokenizer) based on the delimiter '/'. Compare the tokens from either argument against each other using the compare method of java.lang.String class. If the result is not 0 then return the result. Otherwise repeat (tokenize, compare). If you run out of tokens then return 0.
Put your Strings in a java.util.ArrayList (instead of your array). Use java.util.Collections to sort the ArrayList using the comparator you have defined above.
You then have a sorted ArrayList of your Strings.
create a map
for each string {
String path = the prefix of the string.
look up path in the Map
if it does not exist add it with the value 1
if it does exist, bump up the value in the map.
}
If you want all results, dump the map into (key, value) pairs and sort by value.
If you only want the top result, you simply trap it in the loop (meaning every time you bump up a value, if it exceeds the largest value you have yet seen - remember that key as the most popular.)