choosing the right categorie
I have a serie of "categories" every categorie is a set of N collections.
for example (a categorie consists of a range of years and a range of zip-codes)
categorie1 : 1910-1960 , 10000-12060
categorie2 : 1920-1955, 11050-11090
categorie3 : 1956-2005, 77000-77001
as you can see some categories fall completely into others, other categories share
some years or ZIP's with other categories. But no 2 categories can be exactly the same.
When a new "value" is given, the algorithm should give the categorie the "value" fits in the best.
for example:
value1: 1977 , 77000 -> categorie3
value2: 1944 , 11060 -> categorie2. Categorie1 and 2 are both good, but categorie2 is
better because it is narrower than categorie1
I tried something with a hashmap where a plcad the categories on but that didn't work
[887 byte] By [
jochima] at [2007-10-2 8:19:29]

Far from perfect, but maybe it's of use to you:
public class Category {
int lowerBoundYear;
int upperBoundYear;
int lowerBoundZip;
int upperBoundZip;
/** */
public Category(int lby, int uby, int lbz, int ubz) {
lowerBoundYear = lby;
upperBoundYear = uby;
lowerBoundZip = lbz;
upperBoundZip = ubz;
}
/** */
public boolean equals(Object obj) {
if(obj == null) return false;
if(obj == this) return true;
if(!(obj instanceof Category)) return false;
Category other = (Category)obj;
return lowerBoundYear == other.lowerBoundYear &&
upperBoundYear == other.upperBoundYear &&
lowerBoundZip == other.lowerBoundZip &&
upperBoundZip == other.upperBoundZip;
}
/** */
public boolean valueFits(int year, int zip) {
return lowerBoundYear <= year &&
upperBoundYear >= year &&
lowerBoundZip <= zip &&
upperBoundZip >= zip;
}
/** */
public int getYearDifference() {
return upperBoundYear - lowerBoundYear;
}
/** */
public String toString() {
StringBuffer strb = new StringBuffer();
strb.append("year = "+lowerBoundYear+" - "+upperBoundYear+", ");
strb.append("zip = "+lowerBoundZip+" - "+upperBoundZip);
return strb.toString();
}
}
// -
import java.util.List;
import java.util.ArrayList;
public class CategoryList {
List<Category> categories;
/** */
public CategoryList() {
categories = new ArrayList<Category>();
}
/** */
public void add(Category c) {
if(!categories.contains(c)) {
categories.add(c);
}
}
/** */
public Category getBestFitCategory(int year, int zip) {
Category bestCat = null;
int bestYearDiff = Integer.MAX_VALUE;
for(int i = 0; i < categories.size(); i++) {
Category tempCat = categories.get(i);
int tempYearDiff = tempCat.getYearDifference();
if(tempCat.valueFits(year, zip) && tempYearDiff < bestYearDiff) {
bestCat = tempCat;
bestYearDiff = tempYearDiff;
}
}
return bestCat;
}
/** */
public String toString() {
StringBuffer strb = new StringBuffer("Categories:\n");
for(int i = 0; i < categories.size(); i++) {
strb.append("> "+(i+1)+" - "+categories.get(i)+"\n");
}
return strb.toString();
}
}
// -
public class TestCategories {
/** */
public static void main(String[] args) {
CategoryList list = new CategoryList();
Category c1 = new Category(1910, 1960, 10000, 12060);
Category c2 = new Category(1920, 1955, 11050, 11090);
Category c3 = new Category(1956, 2005, 77000, 77001);
list.add(c1); list.add(c2); list.add(c3);
System.out.println(list);
int y, z;
y = 1977; z = 77000;
System.out.println("Values: y="+y+", z="+z+"; best fit"+
" --> "+list.getBestFitCategory(y, z));
y = 1944; z = 11060;
System.out.println("Values: y="+y+", z="+z+"; best fit"+
" --> "+list.getBestFitCategory(y, z));
}
}
hi,I'll try it out ASAP.I think it will work, I'm just not sure how much proectime it wil take.I need something fast because it has to deal with a few million new values. That's why I start thinking with a mjashmap because retrieving an element takes O(1) there
Your categories only seem to have a partial order. What do you do if there are two matching but incomparable categories?
I have rewritten your code for more generic categories, but did'nt had the time to test it allready.
package src.categories;
import java.util.Date;
/**
* holder for a category
* @author Jochim
*
*/
public class Category {
//************
//*Attributen*
//************
/**
* number of attributes for a category
*/
private int aantal;
/**
* classes of the categories
*/
private Class[] classes;
/**
* lower- and upperbound of each attribute
*/
private Object[][] values;
/**
* hashcode
*/
private final int hashcode;
//*************
//*Constructor*
//*************
/**
* construct a new categorie
*/
public Category(int aantal, Class[] classes, Object[][] values){
this.aantal = aantal;
this.classes = classes;
this.values = values;
//calculate the hashcode
int tempHashCode = this.aantal;
for(int i = 0 ; i < classes.length ; i++)
tempHashCode += classes[i].hashCode();
for(int i = 0; i < values.length ; i++)
for (int j = 0 ; j < values[i].length ; j++)
tempHashCode+= values[i][j].hashCode();
hashcode = tempHashCode;
}
//***********
//*Accesoren*
//***********
/**
* get the number of attributes
*/
public int getAantal(){
return aantal;
}
/**
* get the classes of the attributes
* @return classes
*/
public Class[] getClasses(){
return classes;
}
/**
* get the lower- and upperbounds of each attribute.
* @return values
*/
public Object[][] getValues(){
return values;
}
//*****************
//*Andere methodes*
//*****************
public boolean equals(Object obj) {
if(obj == null) return false;
if(obj == this) return true;
if(!(obj instanceof Category)) return false;
Category other = (Category)obj;
//number of attributes must be the same
if(aantal != other.getAantal())
return false;
//classes must be the same
Class[] otherClasses = other.getClasses();
for(int i = 0 ; i < classes.length ; i++)
if(!classes[i].equals(otherClasses[i]))
return false;
//values must be the same
Object[][] otherValues = other.getValues();
for(int i = 0; i < otherValues.length ; i++)
for (int j = 0 ; j < otherValues[i].length ; j++)
if (!values[i][j].equals(otherValues[i][j]))
return false;
return true;
}
/**
* @return the hashcode of this Category
*/
public int hashCode() {
return hashcode;
}
/**
* check if a values fits the category
* @return
*/
public boolean valueFits(Class[] objectClasses, Object[] objectValues) {
if (objectClasses.length != objectValues.length) return false;
if(objectClasses.length != classes.length) return false;
//classes must be the same
for(int i = 0 ; i < classes.length ; i++)
if(!classes[i].equals(objectClasses[i]))
return false;
//compare the values
.................................;;
}
}
instead of using a categorylist for retrieving the most right category i came up with something like this:
* devide each attribute in some intervals. ex. birthyears: 1909-1950, 1951 - 1980, 1981 - 2005
* give each attribute a number, this will be used as an order for examination
* we will now build a hashmap:
key: a combination of 1 interval for each attribute
value: a categorylist with all the categorys that fall (partly or completely) in the 3 intervals
We form a table wich contains all the categories. We give each category a level. We start initially with level 0.
If a category falls in an interval the level is increased by one.
example: which categories fall into 60000- 65000, 1910-1920, M? we get the table
| categorie1 | level 3 |
| categorie2 | level 2 |
| categorie3 | level 3 |
| categorie4 | level 1 |
So the categorylist will conatin category1 & 3
The next step is finding the categories who fall into 60000- 65000, 1910-1920, F
For that we only need to bring the categories who have level 3 to 2 again. Now we can do test test for the third attribute on all the categories who have level2.
Using this backtracking will decrease the number of operations,...
* finding the best category for a new tuple.
* find for each attribute the interval which the tuple's values falls in.
* the combination of the above will form the key for finding the categorylist on tha map.
* Use the categorylist for finding the best category.
remarks:
--
* When retrieving a categorylist from the map, it must be checked if there is a category which can hold the tuple. Because it isn't said that all the points in the threedimensional space can be mapped on 1 of the categories.
* using a list for all the categories would make a "find"-operation O(n?. Now only the find-operation on the little categorylist will take O(n?.
* There must be an optimal rate between the number of intervals per attribute, and the number of categories.
* the more intervals at the attributes --> the less categories on 1 list
* the more intervals at the attributes --> the more work to map the categories to the right key on the map
Question: Does somebody now a good rate between the number of intervals per attribute, and the number of categories?
> Your categories only seem to have a partial order.> What do you do if there are two matching but> incomparable categories?I don't understand your question ...
> I don't understand your question ...
I'll give an example:
category1: 1910 - 1960, 11050 - 12060
category2: 1920 - 1970, 11050 - 12060
category3: 1930 - 1950, 10000 - 77001
value: 1940, 11950
All the categories are good, but none is narrower than the others. Categories 1 and 2 have incomparable date spans, and category 3 has a narrower date span than 1 and 2, but a wider zip span.
there has to be some "weight" per attribute that chooses the one above the other because its more important. So if 2 intervals of different attributes are the same, it chooses the one with the biggest weight.
If 2 intevrals of the same attribute are *** narrow, it has to choose one at random. But that's not possible in my database