choosing the right categorie

I have a serie of "categories" every categorie is a set of N collections.

for example (a categorie consists of a range of years and a range of zip-codes)

categorie1 : 1910-1960 , 10000-12060

categorie2 : 1920-1955, 11050-11090

categorie3 : 1956-2005, 77000-77001

as you can see some categories fall completely into others, other categories share

some years or ZIP's with other categories. But no 2 categories can be exactly the same.

When a new "value" is given, the algorithm should give the categorie the "value" fits in the best.

for example:

value1: 1977 , 77000 -> categorie3

value2: 1944 , 11060 -> categorie2. Categorie1 and 2 are both good, but categorie2 is

better because it is narrower than categorie1

I tried something with a hashmap where a plcad the categories on but that didn't work

[887 byte] By [jochima] at [2007-10-2 8:19:29]
# 1

Far from perfect, but maybe it's of use to you:

public class Category {

int lowerBoundYear;

int upperBoundYear;

int lowerBoundZip;

int upperBoundZip;

/** */

public Category(int lby, int uby, int lbz, int ubz) {

lowerBoundYear = lby;

upperBoundYear = uby;

lowerBoundZip = lbz;

upperBoundZip = ubz;

}

/** */

public boolean equals(Object obj) {

if(obj == null) return false;

if(obj == this) return true;

if(!(obj instanceof Category)) return false;

Category other = (Category)obj;

return lowerBoundYear == other.lowerBoundYear &&

upperBoundYear == other.upperBoundYear &&

lowerBoundZip == other.lowerBoundZip &&

upperBoundZip == other.upperBoundZip;

}

/** */

public boolean valueFits(int year, int zip) {

return lowerBoundYear <= year &&

upperBoundYear >= year &&

lowerBoundZip <= zip &&

upperBoundZip >= zip;

}

/** */

public int getYearDifference() {

return upperBoundYear - lowerBoundYear;

}

/** */

public String toString() {

StringBuffer strb = new StringBuffer();

strb.append("year = "+lowerBoundYear+" - "+upperBoundYear+", ");

strb.append("zip = "+lowerBoundZip+" - "+upperBoundZip);

return strb.toString();

}

}

// -

import java.util.List;

import java.util.ArrayList;

public class CategoryList {

List<Category> categories;

/** */

public CategoryList() {

categories = new ArrayList<Category>();

}

/** */

public void add(Category c) {

if(!categories.contains(c)) {

categories.add(c);

}

}

/** */

public Category getBestFitCategory(int year, int zip) {

Category bestCat = null;

int bestYearDiff = Integer.MAX_VALUE;

for(int i = 0; i < categories.size(); i++) {

Category tempCat = categories.get(i);

int tempYearDiff = tempCat.getYearDifference();

if(tempCat.valueFits(year, zip) && tempYearDiff < bestYearDiff) {

bestCat = tempCat;

bestYearDiff = tempYearDiff;

}

}

return bestCat;

}

/** */

public String toString() {

StringBuffer strb = new StringBuffer("Categories:\n");

for(int i = 0; i < categories.size(); i++) {

strb.append("> "+(i+1)+" - "+categories.get(i)+"\n");

}

return strb.toString();

}

}

// -

public class TestCategories {

/** */

public static void main(String[] args) {

CategoryList list = new CategoryList();

Category c1 = new Category(1910, 1960, 10000, 12060);

Category c2 = new Category(1920, 1955, 11050, 11090);

Category c3 = new Category(1956, 2005, 77000, 77001);

list.add(c1); list.add(c2); list.add(c3);

System.out.println(list);

int y, z;

y = 1977; z = 77000;

System.out.println("Values: y="+y+", z="+z+"; best fit"+

" --> "+list.getBestFitCategory(y, z));

y = 1944; z = 11060;

System.out.println("Values: y="+y+", z="+z+"; best fit"+

" --> "+list.getBestFitCategory(y, z));

}

}

prometheuzza at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...
# 2
hi,I'll try it out ASAP.I think it will work, I'm just not sure how much proectime it wil take.I need something fast because it has to deal with a few million new values. That's why I start thinking with a mjashmap because retrieving an element takes O(1) there
jochima at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...
# 3
Your categories only seem to have a partial order. What do you do if there are two matching but incomparable categories?
YAT_Archivista at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...
# 4

I have rewritten your code for more generic categories, but did'nt had the time to test it allready.

package src.categories;

import java.util.Date;

/**

* holder for a category

* @author Jochim

*

*/

public class Category {

//************

//*Attributen*

//************

/**

* number of attributes for a category

*/

private int aantal;

/**

* classes of the categories

*/

private Class[] classes;

/**

* lower- and upperbound of each attribute

*/

private Object[][] values;

/**

* hashcode

*/

private final int hashcode;

//*************

//*Constructor*

//*************

/**

* construct a new categorie

*/

public Category(int aantal, Class[] classes, Object[][] values){

this.aantal = aantal;

this.classes = classes;

this.values = values;

//calculate the hashcode

int tempHashCode = this.aantal;

for(int i = 0 ; i < classes.length ; i++)

tempHashCode += classes[i].hashCode();

for(int i = 0; i < values.length ; i++)

for (int j = 0 ; j < values[i].length ; j++)

tempHashCode+= values[i][j].hashCode();

hashcode = tempHashCode;

}

//***********

//*Accesoren*

//***********

/**

* get the number of attributes

*/

public int getAantal(){

return aantal;

}

/**

* get the classes of the attributes

* @return classes

*/

public Class[] getClasses(){

return classes;

}

/**

* get the lower- and upperbounds of each attribute.

* @return values

*/

public Object[][] getValues(){

return values;

}

//*****************

//*Andere methodes*

//*****************

public boolean equals(Object obj) {

if(obj == null) return false;

if(obj == this) return true;

if(!(obj instanceof Category)) return false;

Category other = (Category)obj;

//number of attributes must be the same

if(aantal != other.getAantal())

return false;

//classes must be the same

Class[] otherClasses = other.getClasses();

for(int i = 0 ; i < classes.length ; i++)

if(!classes[i].equals(otherClasses[i]))

return false;

//values must be the same

Object[][] otherValues = other.getValues();

for(int i = 0; i < otherValues.length ; i++)

for (int j = 0 ; j < otherValues[i].length ; j++)

if (!values[i][j].equals(otherValues[i][j]))

return false;

return true;

}

/**

* @return the hashcode of this Category

*/

public int hashCode() {

return hashcode;

}

/**

* check if a values fits the category

* @return

*/

public boolean valueFits(Class[] objectClasses, Object[] objectValues) {

if (objectClasses.length != objectValues.length) return false;

if(objectClasses.length != classes.length) return false;

//classes must be the same

for(int i = 0 ; i < classes.length ; i++)

if(!classes[i].equals(objectClasses[i]))

return false;

//compare the values

.................................;;

}

}

jochima at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...
# 5

instead of using a categorylist for retrieving the most right category i came up with something like this:

* devide each attribute in some intervals. ex. birthyears: 1909-1950, 1951 - 1980, 1981 - 2005

* give each attribute a number, this will be used as an order for examination

* we will now build a hashmap:

key: a combination of 1 interval for each attribute

value: a categorylist with all the categorys that fall (partly or completely) in the 3 intervals

We form a table wich contains all the categories. We give each category a level. We start initially with level 0.

If a category falls in an interval the level is increased by one.

example: which categories fall into 60000- 65000, 1910-1920, M? we get the table

| categorie1 | level 3 |

| categorie2 | level 2 |

| categorie3 | level 3 |

| categorie4 | level 1 |

So the categorylist will conatin category1 & 3

The next step is finding the categories who fall into 60000- 65000, 1910-1920, F

For that we only need to bring the categories who have level 3 to 2 again. Now we can do test test for the third attribute on all the categories who have level2.

Using this backtracking will decrease the number of operations,...

* finding the best category for a new tuple.

* find for each attribute the interval which the tuple's values falls in.

* the combination of the above will form the key for finding the categorylist on tha map.

* Use the categorylist for finding the best category.

remarks:

--

* When retrieving a categorylist from the map, it must be checked if there is a category which can hold the tuple. Because it isn't said that all the points in the threedimensional space can be mapped on 1 of the categories.

* using a list for all the categories would make a "find"-operation O(n?. Now only the find-operation on the little categorylist will take O(n?.

* There must be an optimal rate between the number of intervals per attribute, and the number of categories.

* the more intervals at the attributes --> the less categories on 1 list

* the more intervals at the attributes --> the more work to map the categories to the right key on the map

Question: Does somebody now a good rate between the number of intervals per attribute, and the number of categories?

jochima at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...
# 6
> Your categories only seem to have a partial order.> What do you do if there are two matching but> incomparable categories?I don't understand your question ...
jochima at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...
# 7

> I don't understand your question ...

I'll give an example:

category1: 1910 - 1960, 11050 - 12060

category2: 1920 - 1970, 11050 - 12060

category3: 1930 - 1950, 10000 - 77001

value: 1940, 11950

All the categories are good, but none is narrower than the others. Categories 1 and 2 have incomparable date spans, and category 3 has a narrower date span than 1 and 2, but a wider zip span.

YAT_Archivista at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...
# 8

there has to be some "weight" per attribute that chooses the one above the other because its more important. So if 2 intervals of different attributes are the same, it chooses the one with the biggest weight.

If 2 intevrals of the same attribute are *** narrow, it has to choose one at random. But that's not possible in my database

jochima at 2007-7-16 22:18:24 > top of Java-index,Other Topics,Algorithms...