Algorithm of finding out similar products

hi, i have a problem of how to find out similar products in a web store.

There is 3 tables in database: Product, Property and Product_Property for many-on-many relationship, each product can have one or more properties. for example:

Product A has: p1, p2, p5, p6, p9

B has: p2, p5, p6, p7, p9, p10

C has: p1, p2, p7

A and B has 4 same property but A and C has only 2 same property,

so that the product which is similar with A should be B but not C.

Could you give me some suggestions? I don't think sql can solve this problem.

Thanks in advance!

[601 byte] By [asklxfa] at [2007-10-3 4:18:18]
# 1

A lazy Java solution would be to use a Set for those product properties:Set propertiesA= new HashSet(); // populate it with A's properties

Set propertiesB= new HashSet(); // populate it with B's properties

...

Set commonProperties= propertiesA.retainAll(propertiesB);

int nofPropertiesInCommon= commonProperties.size();

kind regards,

Jos

JosAHa at 2007-7-14 22:19:57 > top of Java-index,Other Topics,Algorithms...
# 2

And a lazy SQL solution would be to use a prepared statement like

SELECT product.*, COUNT(*) AS common

FROM product_property pp

INNER JOIN product_property pp2 ON pp2.property_id=pp.property_id AND pp2.product_id <> pp.product_id

INNER JOIN product ON product.id=pp2.product_id

WHERE pp.product_id = ?

GROUP BY product.id

ORDER BY common DESC, product.name ASC

where you just insert the product_id you are interested in and then you get a list of more or less similar products ranked by similarity.

perkelea at 2007-7-14 22:19:57 > top of Java-index,Other Topics,Algorithms...
# 3
Thank a lot!it works! but i'm not sure about the performance if there are over 100,000 products in product table and each product may have 5-10 properties.
asklxfa at 2007-7-14 22:19:57 > top of Java-index,Other Topics,Algorithms...
# 4

Let's time it! As always when concerned about performance, run with "java -server":

public class t

{

public static void main(String args[])

{

System.out.println("Ignore the first few timings.");

System.out.println("They may include Hotspot compilation time.");

System.out.println("I hope you are running me with \"java -server\"!");

for (int n = 0; n < 5; n++)

doit();

}

public static void doit()

{

int total = 0;

long start = System.currentTimeMillis();

for (int n = 0; n < 1000000; n++) {

Set human = new HashSet();

human.add("legs");

human.add("arms");

human.add("head");

human.add("eyes");

Set fish = new HashSet();

fish.add("gills");

fish.add("tail");

fish.add("head");

fish.add("eyes");

human.retainAll(fish);

int count = human.size();

total += count; // Prevent optimizer from removing entire loop

}

long end = System.currentTimeMillis();

System.out.println("time " + (end - start) + " ms, total " + total);

}

}

My PC executes that 1,000,000 iteration loop in a second. So 100,000 rows would be a tenth of a second. A hundred thousand anything is usually not very much when you consider that current computers execute one or two billion operations per second.

There are also solutions using bit masks if you really need to do that in less than 1/10th a second.

Try out the SQL solution too.

sjasjaa at 2007-7-14 22:19:57 > top of Java-index,Other Topics,Algorithms...