Algorithm of finding out similar products
hi, i have a problem of how to find out similar products in a web store.
There is 3 tables in database: Product, Property and Product_Property for many-on-many relationship, each product can have one or more properties. for example:
Product A has: p1, p2, p5, p6, p9
B has: p2, p5, p6, p7, p9, p10
C has: p1, p2, p7
A and B has 4 same property but A and C has only 2 same property,
so that the product which is similar with A should be B but not C.
Could you give me some suggestions? I don't think sql can solve this problem.
Thanks in advance!
[601 byte] By [
asklxfa] at [2007-10-3 4:18:18]

A lazy Java solution would be to use a Set for those product properties:Set propertiesA= new HashSet(); // populate it with A's properties
Set propertiesB= new HashSet(); // populate it with B's properties
...
Set commonProperties= propertiesA.retainAll(propertiesB);
int nofPropertiesInCommon= commonProperties.size();
kind regards,
Jos
JosAHa at 2007-7-14 22:19:57 >

And a lazy SQL solution would be to use a prepared statement like
SELECT product.*, COUNT(*) AS common
FROM product_property pp
INNER JOIN product_property pp2 ON pp2.property_id=pp.property_id AND pp2.product_id <> pp.product_id
INNER JOIN product ON product.id=pp2.product_id
WHERE pp.product_id = ?
GROUP BY product.id
ORDER BY common DESC, product.name ASC
where you just insert the product_id you are interested in and then you get a list of more or less similar products ranked by similarity.
Let's time it! As always when concerned about performance, run with "java -server":
public class t
{
public static void main(String args[])
{
System.out.println("Ignore the first few timings.");
System.out.println("They may include Hotspot compilation time.");
System.out.println("I hope you are running me with \"java -server\"!");
for (int n = 0; n < 5; n++)
doit();
}
public static void doit()
{
int total = 0;
long start = System.currentTimeMillis();
for (int n = 0; n < 1000000; n++) {
Set human = new HashSet();
human.add("legs");
human.add("arms");
human.add("head");
human.add("eyes");
Set fish = new HashSet();
fish.add("gills");
fish.add("tail");
fish.add("head");
fish.add("eyes");
human.retainAll(fish);
int count = human.size();
total += count; // Prevent optimizer from removing entire loop
}
long end = System.currentTimeMillis();
System.out.println("time " + (end - start) + " ms, total " + total);
}
}
My PC executes that 1,000,000 iteration loop in a second. So 100,000 rows would be a tenth of a second. A hundred thousand anything is usually not very much when you consider that current computers execute one or two billion operations per second.
There are also solutions using bit masks if you really need to do that in less than 1/10th a second.
Try out the SQL solution too.