search identical rows in a String[][]
input:
-
a Table: String[][] inout
output:
a Table: String[][] output
What do I have to do whith it?
In the input some rows will be identical to eachother.
ex.
for (int i = 0 ; i < input[0].length ; i++)
input[x][i].equals(input[y][i])
what I do now is:
start at row 0;
for( all the input rows)
take the next row from input;
if(no row in result is identical)
add the row to the result table;
end for
it works, but the calculation time is very high. Due to the nested for
if(no row in result is identical) -> all the result rows has to be iterated in here.
Is there a way to make it faster?
[782 byte] By [
jochima] at [2007-10-2 5:15:06]

> it works, but the calculation time is very high. Due
> to the nested for
> if(no row in result is identical) -> all the result
> rows has to be iterated in here.
>
> Is there a way to make it faster?
If I understand your current implementation, it's an O(n? process. One approach is to create a class called Row, and implement the equals and hashCode methods. Then you can add the rows to a HashSet and you will get only the unique rows and it will have an O(n) time complexity (if I'm not mistaken).
idd,the calculation time is O(n?, and I was looking for the O(n log n) variant, but an O(n) is even better
> idd,
> the calculation time is O(n?, and I was looking for
> the O(n log n) variant, but an O(n) is even better
If you do a binary sort and then go through the list comparing each row the previous value, you get O(n + nlogn). I'm going to take a wild guess and say that hashing will actually perform faster.
I implemented the hash, and I used a HashMap, 'cause I also need to remember the original rownumber, and the number of rows in the original table that are labeled on it.
As keys I use the Strings from the input that have to be the same, and as values the number of lines that are labeled on it an their numbers.
in the FrequencyRowKey I implemented:
public boolean equals(FrequencyRow that){
System.out.println("equals opgeroepen");
int aant = 0;
for (int i = 0 ; i < cells.length ; i++){
if(cells[i].getText().equals(that.getCell(i).getText()))
aant++;
}
if(aant == cells.length)
return true;
return false;
}
public int hashCode() {
int result = HashCodeUtil.SEED;
for (int i = 0 ; i < cells.length ; i++){
result = HashCodeUtil.hash(result, cells[i].getText());
}
return result;
}
My next ide was as follows:
if map contains key
update the value
else insert the key and value
The problem is I never go into that If, and always in the else.
So I tested the hashcode equalities by using an iterator
Collection<FrequencyRowKey> keys = map.keySet();
for (Iterator it = (Iterator) keys.iterator(); it.hasNext();){
if(it.next().hashCode() == current.hashCode())
System.out.println("same hascodes found");
With current the next row I want to put on the map.
What I see then is that the hashcodes are equal, but still when I use
if(map.containsKey(current)){
((FrequencyRowValue)map.get(current)).addtuples(currentValue);
}
else map.put(current,currentValue);
he keeps adding new key-value pairs instead of just updating the values from the corresponding key.
Is there something I looked over?
> public boolean equals(FrequencyRow that){
you should override equals(Object).
as soon as cells != that.cells you can return false.
No need to continue going through the whole array.
> he keeps adding new key-value pairs instead of just updating the
> values from the corresponding key.
> Is there something I looked over?
The contains(...) method is going to call the equals(Object) method,
not the equals(FrequencyRow) method.
Put a System.out.println in your equals method to confirm it's not being called.
> he keeps adding new key-value pairs instead of just
> updating the values from the corresponding key.
> Is there something I looked over?
I skimmed over the code so there may be other issues but here's one that will definitely cause this problem:
public boolean equals(FrequencyRow that)
This is a very common mistake that comes from a misunderstanding of how Java binds methods. Java uses compile-time or static binding for overloaded methods. Also, equals(FrequencyRow) doesn't override equals(Object), it overloads it.
Your class now has two equals methods:
equals(Object)
equals(FrequencyRow)
The compiler determines which one to call based on the type of the reference passed into it. For example: Try this with two instances that should be equals (but are not the same instance)
FrequencyRow fr1 = ...;
FrequencyRow fr2 = ...;
System.out.println(fr1.equals(fr2));
System.out.println(fr1.equals((Object) fr2));
What you should find is that the first call returns true and the second returns false.
In you code, HashMap doesn't have any 'knowledge' of the FrequencyRow type. So not only can it not bind the call to your overloaded equals, it doesn't even 'know' it exists!
There are languages that support dynamic binding for overloaded methods. This is often called 'double-dispatch'. Java just isn't one of them.
The other thing I would change is that you can quite comparing elements when you see two that are not the same:
public boolean equals(Object o)
{
RowFrequency that;
if (o instanceof RowFrequency) that = (RowFrequency) o;
else return false;
if (cells.length != that.cells.length) return false;
for (int i = 0 ; i < cells.length ; i++){
if(!cells[i].getText().equals(that.getCell(i).getText())) return false;
}
return true;
}
One more thing about the overloaded method binding. Java generics in 1.5 kind of middy the waters on this subject. When you define a class that extends HashMap<String, String> for example, the put(String, String) method in that derived class does override the parent's method (which erases to put(Object, Object)). There is some magic going on here that is new and only applies to generic methods. The rules did not change for non-generic methods.
I already noticed that my equals method wasn't called, but thaught that that came from the fact that the map thaught that the hashcodes were different.
I pasted tour method into my code and runned 2 testrounds with 500 en 1000 inputs and the procestime for the whole algorithm:
500 -> +/- 750 ms
1000 -> +/- 1600 ms
so that looks very good. Certianly when you start with an O(n? algorithm.
thanks guys!
>500 -> +/- 750 ms
>1000 -> +/- 1600 ms
How many columns are in each row?
These numbers seem big considering the
number of rows is so small.
There are a few improvements that could be made
assuming row is immutable:
1. compute hashCode in the constructor.
2. when comparing equals, first check hash code
if hash codes are equal, then check contents
In your code, you do:
if (contains(x)) {
get(x);
}
Instead, you can do:
Object y = get(x);
if (y != null) {
}
which should be roughly twice as fast.
> 1. compute hashCode in the constructor.[and store it in a final variable]This will make a big difference I think.
Dependent on how much entropy your data contains,
you can improve this algorithm even further. Assume
that your data is really random, it would be sufficient to
hash only a small portion of the row since the likelihood
of ending up with the same integer is really low.
You can optimize this by hashing every k number of cells
and then plot your runtime against k.
An even more cool solution would be a self optimizing
algorithm. My guess is that this T(k) function is convex
(see http://mathworld.wolfram.com/ConvexFunction.html )
which means that you can adjust k at runtime without
running the risk of an instable algorithm or ending up
in a local minima.
I think it would be interesting to see how this function
would differ dependent on the data.
parza at 2007-7-16 1:17:33 >

> which means that you can adjust k at runtime without
> running the risk of an instable algorithm or ending
> up in a local minima.
What do you mean adjust k at runtime? Do you mean changing how many rows you evaluate over the course of looping over rows? If that is the case, you cannot do that because the HashMap will not work properly.
According to the original post the input was String[][] and
the output String[][] and I agree that k must stay constant
during one request. The hashing could be something like
int result = HashCodeUtil.SEED;
for (int i = 0 ; i < cells.length ; i = i + k) {
result = HashCodeUtil.hash(result, cells[i].getText());
}
However, between requests it is possible to change k. If the
input varies in size the time must be adjusted accordingly
and it is probably a good idea to attach some drag to the
time value in order to stabilize the algorithm against
temporary variations in the input data. That is
T_{n+1}(k) = T_n(k) + 0.01*(T'(k)-T_n(k))
where T'(k) is the size adjusted time for a newly answered
request.
To determine how this function varies dependent on indata
is probably a good assignment for M.Sc. students, if any
teacher is listening. :)
parza at 2007-7-16 1:17:33 >

> However, between requests it is possible to change k.
No, it's not safe. Suppose I have two distinct rows that are equivalent i.e. row1.equals(row2) returns true.
The code calculates the hash for row1 with a k of 5. It gets a hashCode of 257. The value is modded against the number of buckets and stored n the hashtable.
Then we change k to 10. The hashcode for row2 is evaluated to be 381. The value is modded to the sie of the table, it looks in that bucket and finds it empty and inserts the value.
You cannot change the hasCode calculation during execution because equivalent Objects must always provide the same hashCode at all times of execution. Even if the hash is calculated minutes or hours apart.
>> 1. compute hashCode in the constructor.
>>[and store it in a final variable]
went from +/-1600 ms to +/-900ms
>2. when comparing equals, first check hash code
>if hash codes are equal, then check contents
Is equals not only called when 2 hashcodes are equal in a collection?
>In your code, you do:
>if (contains(x)) {
>get(x);
>}
>Instead, you can do:
>Object y = get(x);
>if (y != null) {
>}
went from +/- 900ms to +/- 850ms
>How many columns are in each row?
>These numbers seem big considering the
>number of rows is so small.
There are 3 columns in the example I use, but the procestime is of the whole algoritm, in that algorithm this piece of code is repeated mostly between 15 & 25 times, and there are also 4 other steps in the algorithm
>Assume that your data is really random
In my example I can't really assume that, because the 3 columns in my example are{sex,ZIP,date of birth}
This idea felt interesting enough to spend some time on. I
implemented it and at least for the parameters I chose it
improved it self through adaptation by more then 7 times.
As you can see below JVM and GC issues is taking up time
in the first 10 runs, but after overhead work has become
constant a request seams to take about 10 ms when k=1. The
time seams to become stable around k=11 and by then the
time is about 1.5 ms which is a significant improvement.
$ java RowUtils 100 10000 0.5 200
m=0k=1time=16.0 ms
m=5k=4time=12.53795 ms
m=10k=1time=9.4765640955 ms
m=15k=1time=10.271426332751796 ms
m=20k=1time=10.750784535226607 ms
m=25k=1time=11.091350760205959 ms
m=30k=1time=10.689441710394016 ms
m=35k=1time=10.882228435570562 ms
m=40k=1time=11.186067068920064 ms
m=45k=1time=10.762470743526608 ms
m=50k=1time=10.531151349345027 ms
m=55k=1time=10.823149560274745 ms
m=60k=1time=11.085571583846633 ms
m=65k=1time=10.776029164545598 ms
m=70k=1time=10.93335746137253 ms
m=75k=2time=9.883548247365864 ms
m=80k=5time=7.520786404587069 ms
m=85k=4time=6.120479164044618 ms
m=90k=5time=4.9326117415767055 ms
m=95k=10time=3.731677907283628 ms
m=100k=11time=2.87592848747191 ms
m=105k=10time=2.361617012567288 ms
m=110k=13time=2.066921229750858 ms
m=115k=13time=1.7300063169555842 ms
m=120k=12time=1.767671430099103 ms
m=125k=11time=1.6072023027592195 ms
m=130k=14time=1.5314468877562915 ms
m=135k=11time=1.4867140727512127 ms
m=140k=11time=1.5583997928188635 ms
m=145k=13time=1.5682394936616109 ms
m=150k=16time=1.3355397386122445 ms
m=155k=13time=1.4095428602531443 ms
m=160k=11time=1.322830963550879 ms
m=165k=11time=1.2562384556671584 ms
m=170k=11time=1.5608162456869004 ms
m=175k=9time=1.5777663849156578 ms
m=180k=9time=1.5606752726288466 ms
m=185k=9time=1.5776831417346078 ms
m=190k=9time=1.4140161183628686 ms
m=195k=10time=1.5729823777320904 ms
> No, it's not safe.
Following code which produced the print out above seams
safe to me.
import java.util.*;
public class RowUtils {
private int _k = 1;
private double _time = -1;
public int getK() {
return _k;
}
public double getTime() {
return _time;
}
public String[][] findUnique(String[][] rows) {
int k = _k;
if (Math.random()>0.5 && k>1) {
k--;
} else {
k++;
}
Set uniqueRows = new HashSet();
long t0 = System.currentTimeMillis();
for (int i=0;i<rows.length;i++) {
uniqueRows.add(new Row(rows[i], _k));
}
long t1 = System.currentTimeMillis();
double time = t1-t0;
if (_time>0) {
time = _time + 0.1*(time-_time);
}
if (time<_time) {
_k = k;
}
_time = time;
String[][] ret = new String[uniqueRows.size()][rows[0].length];
int m = 0;
for (Iterator i=uniqueRows.iterator();i.hasNext();) {
ret[m] = ((Row)i.next()).getCols();
m++;
}
return ret;
}
public class Row {
private String[] _cols;
private int _hash;
public Row(String[] cols, int k) {
_cols = cols;
_k = k;
for (int i=0;i<cols.length;i=i+k){
_hash = _hash + i*cols[i].hashCode();
}
}
public String[] getCols() {
return _cols;
}
public int hashCode() {
return _hash;
}
public boolean equals(Object row) {
String[] cols = ((Row)row)._cols;
if (_cols.length!=cols.length) {
return false;
}
for (int i=0;i<_cols.length;i++) {
if (!_cols[i].equals(cols[i])) {
return false;
}
}
return true;
}
}
public static void main(String[] args) {
if (args.length!=4) {
System.out.println("Usage: RowUtils [# rows] [# columns] "+
"[randomness] [# requests]");
return;
}
int nRows = Integer.parseInt(args[0]);
int nCols = Integer.parseInt(args[1]);
double rand = Double.parseDouble(args[2]);
int req = Integer.parseInt(args[3]);
RowUtils util = new RowUtils();
for (int m=0;m<req;m++) {
String[][] rows = new String[nRows][nCols];
for (int i=0;i<nRows;i++) {
for (int j=0;j<nCols;j++) {
if (Math.random()>rand && Math.random()>0.5) {
rows[i][j] = "A";
} else {
rows[i][j] = "B";
}
}
}
util.findUnique(rows);
if (m % 5==0) {
System.out.println("m="+m+"\t k="+util.getK()+
"\t time="+util.getTime()+" ms");
}
}
}
}
parza at 2007-7-20 18:32:02 >

After removing a bug it performed even better. _k=k must beremoved in the Row constructor. I realized that when the time is down to 1.5 ms it can not adapt any better becausethe clock is not accurate enough. Is there a way to measure time more precise in java?
parza at 2007-7-20 18:32:02 >

> > No, it's not safe.
>
> Following code which produced the print out above
> seams
> safe to me.
If you restrict k to a single value for all value added to a hashtable, it works (although you never check to see that it did indeed find the duplicates.) But as a general soluion, it is not safe. Essentially you've added k as a part of the identity:
public static void main(String[] args) throws Exception
{
String[] strings = new String[]{"A", "B", "C", "D", "E", "F"};
Row rowA = new Row(strings, 2);
Row rowB = new Row(strings, 3);
HashSet set = new HashSet();
set.add(rowA);
set.add(rowB);
System.out.println("size: " + set.size());
}
The output is "size: 2". However, both rows are identical per their equals method. Your hashCode method is not consistent with equals.
I know what you mean and in general I agree, but I made the
Row class a sub class in order to restrict access. Now it is
up to the programmer to make sure to use the Row class with
causion within the RowUtils class.
I think we discussed this before, and if I remember right we
agreed on that it would be better to have an Equality interface
like the Comparator to feed to the Map in order to make the
equality test independent of Row class.
I found another bug
uniqueRows.add(new Row(rows, _k));
should be
uniqueRows.add(new Row(rows, k));
and makes the adaptation faster.
parza at 2007-7-20 18:32:02 >

> I know what you mean and in general I agree, but I
> made the
> Row class a sub class in order to restrict access.
> Now it is
> up to the programmer to make sure to use the Row
> class with
> causion within the RowUtils class.
If Row were private, I'd be fine with it.
> I think we discussed this before, and if I remember
> right we
> agreed on that it would be better to have an Equality
> interface
> like the Comparator to feed to the Map in order to
> make the
> equality test independent of Row class.
Yes and actually, if there were such a class in the Collections framework, we wouldn't need a Row class at all. We could define the Equalator to work with String[][] directly.
> If Row were private, I'd be fine with it.
My misstake.
> Yes and actually, if there were such a class in the Collections
> framework, we wouldn't need a Row class at all. We could define
> the Equalator to work with String[][] directly.
Even better. I wonder how long it will take before they add it.
parza at 2007-7-20 18:32:02 >

> Even better. I wonder how long it will take before> they add it.I have doubts that they ever will. They deny there is a problem.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4771660Has one vote - mine. There are two related bugs that have a total of 7 votes.
I posted a question close to this issue at http://forum.java.sun.com/thread.jspa?threadID=685408
parza at 2007-7-20 18:32:02 >

> I posted a question close to this issue at> http://forum.java.sun.com/thread.jspa?threadID=685408I posted a reply in that thread.
> They deny there is a problem.
From the bugparade, it seems they considered it.
Consider the OPs question
Case 1: no Equal{insert favoured suffix}
a) define custom class that overrides hashCode() and equals(Object) based on data
b) create hash container
c) for each custom class, add it to the hash table
Case 2: use Equal{ibid}
d) define custom class that implements Equal{ibid} (requires implementing equals(Object, Object) and hashValue(Object).
(hashValue optimization would be difficult to do in this case)
e)create hash container passing in custom class
f) pass container of objects to hash container method
Case 2 actually seems like it is more work than case 1.
> (hashValue optimization would be difficult to do in
> this case)
How do you figure?
> Case 2 actually seems like it is more work than case
> 1.
You leave out the rest of the usage in case 1.
get the custom class. Extract the 'real' contents.
An Equalator can be an anonymous inner class. The Custom wrapper cannot. Also, you cannot pass your Collection of custom wrappers to other APIs with no knowledge of the Wrappers.
So sure if you take a simple myopic view of the problem and solution, it doesn't seem helpful.
If you think this isn't needed, you would agree that Comparator is not needed either, right?
> How do you figure?
The hashValues would have to be stored external of the actual object, inside the Equal{ibid}. That would require some structure, most likely hashtable/map, to store Integer objects. It seems not as nice.
>An Equalator can be an anonymous inner class. The Custom wrapper cannot.
Somewhat of an apples to oranges comparison. The argument would apply to all interfaces that don't exist.
>Also, you cannot pass your Collection of custom wrappers to other
> APIs with no knowledge of the Wrappers.
Assuming there was an Equal{ibid} interface, what other APIs would exist? What other things would I want to do with a class that implemented Equal{single} or a collection that contains an Equal{ibid} and objects?
> If you think this isn't needed, you would agree that Comparator
> is not needed either, right?
Obviously no. Comparator is very useful for objects that can't be extended and don't implemented Comparable. There are a variety of
data structures based on sorting that are non-trivial to program.
On the other hand, the OPs question involves a fairly simple algorithm.
Based solely on the OPs question, would it be worth it to account for Equal{ibid} for the entire Collections API? Would you want
to pass an Equal{ibid} to an SortedTree? Or just to a HashMap?
I could see hashtable/maps/set supporting it.
> The hashValues would have to be stored external of the ...
Usually you will have some kind of wrapper to store this value
in. For all hash maps today you have the Map.Entry.
> ... could it be worth it to account for Equal{ibid} for the
> entire Collections API?
As default the Equal{ibid} would use existing equals(...) and
hashCode() as default, so adding this functionality would only
increase flexibility, and some complexity but only in the form
of some additional methods,constructors and the Equal{ibid}
interface.
parza at 2007-7-20 18:32:02 >

> > How do you figure?
>
> The hashValues would have to be stored external of
> the actual object, inside the Equal{ibid}. That
> would require some structure, most likely
> hashtable/map, to store Integer objects. It seems
> not as nice.
But clearly not difficult and a poor argument for not having the interface.
> >An Equalator can be an anonymous inner class. The
> Custom wrapper cannot.
>
> Somewhat of an apples to oranges comparison. The
> argument would apply to all interfaces that don't
> exist.
I have no idea what you mean here but the point is that if you use a custom wrapper, you must make it a non-anonymous class or you can't retrieve the data from the wrappers. A Equalator can be anonymous because it isn't needed by anything but the Collection.
> >Also, you cannot pass your Collection of custom
> wrappers to other
> > APIs with no knowledge of the Wrappers.
>
> Assuming there was an Equal{ibid} interface, what
> other APIs would exist?
There are thousands of APIs that use Maps, Lists and Sets. Those APIs know nothing of my custom wrapper classes. They don't need to know about the Equalator to use the Collection.
> What other things would I
> want to do with a class that implemented
> Equal{single} or a collection that contains an
> Equal{ibid} and objects?
Put things in the Collections based on your criteria, not the that of the Objects in the Collection.
> > If you think this isn't needed, you would agree
> that Comparator
> > is not needed either, right?
>
> Obviously no. Comparator is very useful for objects
> that can't be extended and don't implemented
> Comparable.
It's also useful for those that do.
There are a variety of
> data structures based on sorting that are non-trivial
> to program.
> On the other hand, the OPs question involves a fairly
> simple algorithm.
>
> Based solely on the OPs question, would it be worth
> it to account for Equal{ibid} for the entire
> Collections API?
Soley on the OPs question? No. Considering only the next nanosecond, do you need water to live?
> Would you want
> to pass an Equal{ibid} to an SortedTree? Or just to
> a HashMap?
> I could see hashtable/maps/set supporting it.
It probably wouldn't make sense for SortedMaps because they generally don't use equals or hashCode.
I do not think my RFE on this issue helped, but at least it seams like they admit that there is a problem. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355410
parza at 2007-7-20 18:32:06 >
