How to match?

Say if I have a set A of people, each person in this set has knowledge on some of subjects. Now I have a new comer, with knowledge of some subjects. I want to find the person in A whose knowledge set is best to match the new comer. Is there good method or algorithm to do this job.

Further more, if I want to find several people in set A to completely cover the subjects of new comer's knowledge. Is there any method that I can find the best combination?

Thanks in advance

Edward

luqiyi@hotmail.com

[530 byte] By [luqiyia] at [2007-10-2 17:56:34]
# 1

insufficient data for a meaningful answer.

Suppose I know everything that you know and a whole lot more as well. Bob on the other hand knows almost everything that you know and nothing else.

Who is closer to you?

You have to choose how you want your metric to work.

If you encode your knowledge as a long bit stream of facts, (if you know fact 1 set bit one, if you know fact 2 set bit 2 ...) you can always compute the Hamming distance between two bit streams (xor the two bit streams, count the bits - small numbers mean almost all the bits were the same) and treat the Hamming distance as the knowledge metric.

Using the Hamming method, Bob would rank as being closer to you because both of you have many things that you don't know in common. If that works for you, it is fairly simple to compute Hamming distance.

As for finding a group of people that covers the newcomer's knowledge, until you define what you mean by "best combination" your question has no answer. Do you want to cover the newcomer with the smallest possible group of people? If that is what you want, I suspect that that may be a combinatorially intractable problem.

marlin314a at 2007-7-13 19:15:16 > top of Java-index,Other Topics,Algorithms...
# 2

Thank you Marlin for the help.

What I mean is that Bob is a closer choice. If X is knowledgeable person knowing much more subjects, I would perfer choose a combination of Y and Z to cover the new comer's subjects. Because if every new comer chooses X, X is overloaded. I also want the combination group is as small as possible when the size of knowledge of each member in the group is closed to the size of new comer.

The problem is that the total size of knowledge subjects is quite big. There are roughly 100K subjects. The size is also variable.

luqiyia at 2007-7-13 19:15:16 > top of Java-index,Other Topics,Algorithms...
# 3

Marlin,

I tought over again and checked wikipedia with the key words you gave. Probably the similar algorithm Levenshtein distance can help.

Wiki says Levenshtein distance can be used on the job like spelling check.. This is somehting similar to what I want. I can search the person with the most closed knowledge set, and then extract all of the overlapped subjects and repeat this procedue again untill all of the new comer's sbjected are covered.

100K size.. probably I can use a 3 bytes hex data to build the metric.

Thanks again for the help.. Let me know if anything else I should know.

cheers,

Edward

luqiyia at 2007-7-13 19:15:16 > top of Java-index,Other Topics,Algorithms...
# 4
oops, not exactly what Levenshtein distance does. subjects can't be swapped..I should use Hamming distance, but with size restriction..
luqiyia at 2007-7-13 19:15:16 > top of Java-index,Other Topics,Algorithms...
# 5

"I can search the person with the most closed knowledge set, and then extract all of the overlapped subjects and repeat this procedue again untill all of the new comer's sbjected are covered."

This is not necessary the best group of combination. Is there any better way to the group of best combination?

luqiyia at 2007-7-13 19:15:16 > top of Java-index,Other Topics,Algorithms...
# 6

Yes, Levenshtein is not appropriate because it allows swaps. Hamming is appropriate. I don't know what you mean by size restriction.

I mocked up Hamming.

100K topics and 32 bits/int == 3K ints/ individual

the code below whacks through 10,000 distance matches about every 2 seconds.

so if you have a collection of 100K individuals it will fill about 300M of memory to hold their knowledge vectors and will take about 15 to 20 seconds to find the optimal match using Hamming distance.

Something that I will point out about Hamming that may not work for you is this. it counts ignorance the same way it counts knowledge. Thus if I kknow everything that you know, but I know a thousand things and you only know a hundred things our Hamming distance is 900 (the count of things that I know that you don't know) If Bob knows only a hundred things but they are a completely different hundred than the things you know, the distance between you and Bob is 200, much closer than your distance to me. Bob is not actually close in knowledge, rather he is close in ignorance.

You can of course weight those scores. i.e. you can independently count the score for knowledge match and the score for ignorance match and bias by some amount in favor of knowledge match.

As for finding a covering set, I think that you want to do something like this:

put the newcomers knowledge in a buffer

begin{

set a threshold

go through your individuals in random order

find the first individual whose knowledge matches the buffer within the threshold

subtract the matched knowledge from the buffer.

repeat

}

You will need to do some playing with the way you set the threshold. You want the people you add to the covering set to match on more than just one single knowledge point if you can get it. If you can't find a cover at one threshold you need to reduce it and try again.

Here is the code I used to measure speed.

public static void main(String args[]){

int[] foofoo = new int[3000];

int[] feefee = new int[3000];

for(int i = 1; i<1000000; i++){

int t = 0;

for(int j = 0; j<3000; j++){

t += bitCount(foofoo[j]^feefee[j]);

}

if (i%10000 == 0)System.out.println(i);

}

}

static int bitCount(int foo){

int m = 0x55555555;

int r = (foo&m) + ((foo>>>1)&m);

m = 0x33333333;

r = (r&m) + ((r>>>2)&m);

m = 0x0F0F0F0F;

r = (r&m) + ((r>>>4)&m);

m = 0x00FF00FF;

r = (r&m) + ((r>>>8)&m);

m = 0x0000FFFF;

return r = (r&m) + ((r>>>16)&m);

}

marlin314a at 2007-7-13 19:15:16 > top of Java-index,Other Topics,Algorithms...
# 7
Thank you so much, Marlin:)I will try it. Your help is deeply appreciated.
luqiyia at 2007-7-13 19:15:16 > top of Java-index,Other Topics,Algorithms...