Efficient Search on Large Strings
The problem I am trying to solve right now is how to optimize String comparisons under the following circumstances:
I have a list of relatively large strings (between 50 and 200 characters, on average). The list can be as large as, say, 50 strings. MOST of those 50 strings are identical, but do not necessarily have the same identity (hence I can't optimize with == without using .intern()). However, *sometimes* there will be a few that are different. I need to do a search to find the *first* string that does not match the top of the list. Furthremore, I'm doing this activity in a loop of several lists, so my complexity is O(n^3) (when you consider that the complexity of matching strings is O(n)). Also, I do not create the list, so I can't make the entries be interned so I can use ==.
If I understand correctly, determining that strings are unequal is approximately constant time, because we evaluate the hash function first. And if the strings had the same identity then comparison would be constant time. But identical strings with different identities, the comparison is O(n) on the length of the String. Does anybody know of anyway to optimize this in any manner without losing a LOT of accuracy? I.e. can any guranatees be made about the hashcode such that if two strings have different hashcodes, they are likely different with some reasonably high percent confidence?
Two identical strings must always have the same hash code.
Two identical hashcodes can be produced from different strings.
Thus, if the hash codes are different, then you know the strings are different. If the has codes are the same, then the strings might be identical.
> determining that strings are unequal is approximately constant time,
The string hashCode is an O(n) operation. If the length of the strings are bounded by a constant (e.g. 200) then you can only say for your specific problem the hashCode is an O(1) operation.
> if two strings have different hashcodes, they are likely different
> with some reasonably high percent confidence
If the hashcodes are different, then the strings are different.
Given a list of strings, the Knuth, Morris Pratt (KMP) algorithm will allow you to process each character of each string just once. The first string in your list will be used to prime a search object and then all the rest are searched using that search object.
If you have N lists that need processing and cross processing then you will have approx. N^2 searches. You will only need one search object per list.
> Given a list of strings, the Knuth, Morris Pratt
> (KMP) algorithm will allow you to process each
> character of each string just once.
Since you require the whole string to match, the KMP algorithm is an overkill. You just need to check first for equality of length and then check for equality.
Message was edited by:
sabre150
You might also fill a new list with String.intern(). This would add just a linear complexity:+ O(n)Well, just a thought. Don't know about the complexity of String.intern(), though.-Puce
Pucea at 2007-7-11 15:38:57 >

> Two identical strings must always have the same hash
> code.
>
> Two identical hashcodes can be produced from
> different strings.
>
> Thus, if the hash codes are different, then you know
> the strings are different. If the has codes are the
> same, then the strings might be identical.
>
> > determining that strings are unequal is
> approximately constant time,
>
> The string hashCode is an O(n) operation. If the
> length of the strings are bounded by a constant (e.g.
> 200) then you can only say for your specific problem
> the hashCode is an O(1) operation.
>
> > if two strings have different hashcodes, they are
> likely different
> > with some reasonably high percent confidence
>
> If the hashcodes are different, then the strings are
> different.
If I'm not mistaken, hashCode is only O(n) the first time it is called, since it caches the hashCode thereafter. Although, granted, since I'm not the one creating the list, I have no idea whether the hash code has already been cached or not.
> Given a list of strings, the Knuth, Morris Pratt
> (KMP) algorithm will allow you to process each
> character of each string just once. The first string
> in your list will be used to prime a search object
> and then all the rest are searched using that search
> object.
>
> If you have N lists that need processing and cross
> processing then you will have approx. N^2 searches.
> You will only need one search object per list.
Hmm... I'm not familiar with that algorithm. I'll have to look into that. Thank you.
> > Given a list of strings, the Knuth, Morris Pratt
> > (KMP) algorithm will allow you to process each
> > character of each string just once.
>
> Since you require the whole string to match, the KMP
> algorithm is an overkill. You just need to check
> first for equality of length and then check for
> equality.
>
> Message was edited by:
> sabre150
Yes, that sounds like a reasonable optimization, but I'm not sure that it solves the problem. In this case, I expect the strings to be identical 99% of the time, and I need to optimize a check to find the 1% of the time when they are NOT identical. So, your strategy would make the 1% case faster, while making the 99% case marginally slower (not significantly, but the point is, I need to optimize the 99% case, not the 1% case).
> You might also fill a new list with String.intern().
> This would add just a linear complexity:
> + O(n)
>
> Well, just a thought. Don't know about the complexity
> of String.intern(), though.
>
> -Puce
I don't know about the complexity of intern, either, for sure, but I suspect that it is at least O(n), because it needs to check to see if the string has already been interned. Realistically, calling String.intern() on every string to be evaluated immediately before evaluating them is basically useless, because the complexity of String.intern is very probably the same as the complexity of String.equals (because the contract for intern requires that equal strings return the same identity through the intern method).
What is the reason that you don't call intern?
kajbja at 2007-7-11 15:38:57 >

> > You might also fill a new list with
> String.intern().
> > This would add just a linear complexity:
> > + O(n)
> >
> > Well, just a thought. Don't know about the
> complexity
> > of String.intern(), though.
> >
> > -Puce
>
> I don't know about the complexity of intern, either,
> for sure, but I suspect that it is at least O(n),
> because it needs to check to see if the string has
> already been interned. Realistically, calling
> String.intern() on every string to be evaluated
> immediately before evaluating them is basically
> useless, because the complexity of String.intern is
> very probably the same as the complexity of
> String.equals (because the contract for intern
> requires that equal strings return the same identity
> through the intern method).
Ah, so the lists of strings changes frequently? Otherwise you only have to call intern once, and once only. (You call intern when you place the string in the list)
Kaj
Message was edited by:
kajbj
kajbja at 2007-7-11 15:38:57 >

> > Given a list of strings, the Knuth, Morris Pratt
> > (KMP) algorithm will allow you to process each
> > character of each string just once.
>
> Since you require the whole string to match, the KMP
> algorithm is an overkill.
Thanks for the suggestion sabre, but I agree, KMP is not what I'm looking for. The absolute BEST case for KMP is still O(n) on the length of the "key" string, and string equality comparison is already O(n).
> > > You might also fill a new list with
> > String.intern().
> > > This would add just a linear complexity:
> > > + O(n)
> > >
> > > Well, just a thought. Don't know about the
> > complexity
> > > of String.intern(), though.
> > >
> > > -Puce
> >
> > I don't know about the complexity of intern,
> either,
> > for sure, but I suspect that it is at least O(n),
> > because it needs to check to see if the string has
> > already been interned. Realistically, calling
> > String.intern() on every string to be evaluated
> > immediately before evaluating them is basically
> > useless, because the complexity of String.intern
> is
> > very probably the same as the complexity of
> > String.equals (because the contract for intern
> > requires that equal strings return the same
> identity
> > through the intern method).
>
> Ah, so the lists of strings changes frequently?
> Otherwise you only have to call intern once, and once
> only. (You call intern when you place the string in
> the list)
>
> Kaj
>
> Message was edited by:
> kajbj
Yes, I suppose that would work if I was the one placing the Strings in the list, but I'm not. The list is generated by another library.
- Adam
> Yes, I suppose that would work if I was the one
> placing the Strings in the list, but I'm not.
> The list is generated by another library.
But do you get the list frequently? You still only have to create a duplicate list with interned strings when you get the list.
Kaj
kajbja at 2007-7-11 15:38:57 >

I would imagine pretty much any way to search through 50 Strings of 50-200 characters in length would be plenty fast enough. Unless of course your requirement is to do this millions of times per minute.
Is the speed of this operation really an issue in your application? I.e. have you profiled the application and concluded that significant time is spent on precisely this operation?
Otherwise, this seems like an exercise in futility.
> > Yes, I suppose that would work if I was the one
> > placing the Strings in the list, but I'm not.
> > The list is generated by another library.
>
> But do you get the list frequently? You still only
> have to create a duplicate list with interned strings
> when you get the list.
>
> Kaj
Once I get the list, I search it once and discard it, so interning it before searching it doesn't gain me anything.
- Adam
> I would imagine pretty much any way to search through
> 50 Strings of 50-200 characters in length would be
> plenty fast enough. Unless of course your requirement
> is to do this millions of times per minute.
>
> Is the speed of this operation really an issue in
> your application? I.e. have you profiled the
> application and concluded that significant time is
> spent on precisely this operation?
>
> Otherwise, this seems like an exercise in futility.
I see where you're coming from. I have indeed profiled it. The time spent inside the String.equals method alone is over 11 seconds for my test case, which is large, but not even my worst case. And while that might not seem like a long time, it is occuring in a GUI where we want response times to events to be closer to about 15% of that, if possible.
The problem is, while searching a list of size 50, performing string comparison of size 200 seems pretty trivial, doing so inside a loop of size roughly 100, inside another loop of size roughly 15 (in my test case, although it can get as large as 40 in extreme cases), then it adds up.
- Adam
>
> I see where you're coming from. I have indeed
> profiled it. The time spent inside the String.equals
> method alone is over 11 seconds for my test case,
> which is large, but not even my worst case. And
> while that might not seem like a long time, it is
> occuring in a GUI where we want response times to
> events to be closer to about 15% of that, if
> possible.
Also, note that the 11 seconds occurs on my development machine, where it isn't competing with other threads and processes for processing time (at least, not with as many other threads or processes).
- Adam
I've just tried a simple test case with a list of 50 Strings of ~200 characters (created from StringBuilders so they wouldn't get interned). The non-matching String was then inserted at index 45 of 50.
Ran the loop 4000 times in 0.3 of a second (while creating the array *inside* the loop). 40000 times in 1.9 seconds. At 400000 times I got to 18.2 seconds which is in your ballpark. Granted, this is on a pretty fast, but by no means "far out" E6600. Could it be you were a few orders of magnitude off in your previous posts?
If not, my guess is still that your problems are elsewhere. Are you sure all those String comparisons you see in profiling happen in this part of your code?
Here's my test-code, perhaps you can point out obvious deviations from your actual case:import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
public class StringComparisonTest {
private List<String> testList;
private static SimpleDateFormat FORMAT = new SimpleDateFormat(
"HH:mm:ss,SSS");
public StringComparisonTest() {
}
private void createList() {
testList = new ArrayList<String>();
for (int index = 0; index < 50; index++) {
testList
.add(new StringBuilder(
"The Quick brown Fox jumps over the lazy dog and Cozy lummox gives smart squid who asks for job pen. The Quick brown Fox jumps over the lazy dog and Cozy lummox gives smart squid who asks for job pen.")
.toString());
}
testList
.set(
45,
"The Quick brown Fox jumps over the lazy dog and Cozy lummox gives smart squire who asks for job pen. The Quick brown Fox jumps over the lazy dog and Cozy lummox gives smart squire who asks for job pen.");
}
public String findNonMatching() {
String result = null;
String first = testList.get(0);
for (int index = 1; index < testList.size(); index++) {
String test = testList.get(index);
if (!test.equals(first)) {
result = test;
break;
}
}
return result;
}
public static void main(String[] args) {
System.out.println(FORMAT.format(new Date()));
StringComparisonTest test = new StringComparisonTest();
for (int outer = 0; outer < 40; outer++) {
for (int inner = 0; inner < 1000; inner++) {
test.createList();
test.findNonMatching();
}
}
System.out.println(FORMAT.format(new Date()));
}
}
BTW, if the List you're getting is a LinkedList instead of an ArrayList, double those times if using list.get(<n>). Obviously, I should have used an iterator.
> Could it be you were a few orders of magnitude off in your
> previous posts?
>
> If not, my guess is still that your problems are
> elsewhere. Are you sure all those String comparisons
> you see in profiling happen in this part of your
> code?
No, I'm fairly certain that the numbers I provided are correct, and that the String comparisons are happening in that part of my code. I'm not sure why I'm my numbers seem to be off by an order of magnitude to yours. That does seem odd, even considering we're running on different machines.
I'm still left with the question of optimizing a string comparison with a bias towards Strings that ARE equal, since the method used in String.equals is optimized for Strings that are not equal (which is indeed the best algorithm in most applications, but not mine).
But how long does it take to run Herko_ter_Horst's code on your machine?Mine took 30 seconds creating the list inside the loop. Creating the list once it took 4 seconds. 1GHz PC laptop.
ejpa at 2007-7-21 19:33:59 >

I have found that trying to outperform the implementations used in most of the core classes is usually very very hard to do. Significant time has already been spent on making those methods as fast as they can be.
The problem in this case is that there is no "other" way to see if two Strings are in fact equal, other than using equals(). hashCode() doesn't help because it can only determine whether two Strings are NOT equal, because while unequal hashCodes guarantee unequal Strings, equal hashCodes do not guaranteed equal Strings. Of course, you could just assume that equal hashCodes come from equal Strings (and in most cases, you'd be right). However, calculating hashCodes isn't exactly cheap either.
In fact, I just tried the hashCode shortcut in the code above:public String findNonMatching() {
String result = null;
String first = null;
for (String test:testList) {
if(first == null) {
first = test;
}
else {
if(test.length() == first.length() && test.hashCode() == first.hashCode()) {
continue;
}
else {
result = test;
break;
}
}
}
return result;
While this gained me approx. 10% in execution time, I doubt the loss of certainty about the results is worth it.
As another suggestion, is there perhaps some way to take the comparison out of the (inner) loop and/or somehow cache the results?
PS, please note that the code I posted previously runs the loop 40000 times, not 4000.
> But how long does it take to run Herko_ter_Horst's> code on your machine?About 3 seconds in both cases (64 bit Windows, two dual core 2Ghz processors, 4 GB of ram):)
kajbja at 2007-7-21 19:33:59 >

> Of course, you could just assume that> equal hashCodes come from equal Strings (and in most> cases, you'd be right)I'm not sure it is particularly difficult to find unequal strings that have the same hash codes. "ABCDEa123abc" and "ABCDFB123abc" for
YoGeea at 2007-7-21 19:33:59 >

> I'm not sure it is particularly difficult to find> unequal strings that have the same hash codes.> "ABCDEa123abc" and "ABCDFB123abc" for example.How did you come up with those two values?
It's actually pretty easy. The hashcode works by multiplying each character by 31 to the power of its position, and summing.So if you add one to the character on the left, you subtract 31 from the character on the right, and the hash codes are equal.- Adam
> I have found that trying to outperform the
> implementations used in most of the core classes is
> usually very very hard to do. Significant time has
> already been spent on making those methods as fast as
> they can be.
I recognize that. However, I also recognize that in the vast majority of string comparisons, you expect strings to be unequal more often than they are equal, to some large degree. So it makes sense to optimize string comparison for quickly determining that strings are not equal. Which is exactly what the String.equals method has done. My problem is exactly the opposite.
> The problem in this case is that there is no "other"
> way to see if two Strings are in fact equal, other
> than using equals().
You are probably correct. However, I had thought that perhaps somebody might know of a more efficient way that leveraged some of the other data members of a String to make certain guarantees about that String. Its not likely, but its possible.
> I recognize that. However, I also recognize that in
> the vast majority of string comparisons, you expect
> strings to be unequal more often than they are equal,
> to some large degree. So it makes sense to optimize
> string comparison for quickly determining that
> strings are not equal. Which is exactly what the
> String.equals method has done. My problem is exactly
> the opposite.
As others have said. You can't determine if two strings are equal (I'm not talking about the String class implementation) unless you compare character by character or if you have something else to compare. That something else has to be able to uniquely identify the contens of a String and that identifier would in that case have to be about as large as the string (unless you can use some compression)
Kaj
kajbja at 2007-7-21 19:33:59 >

> The problem I am trying to solve right now is how to
> optimize String comparisons under the following
> circumstances:
>
> I have a list of relatively large strings (between 50
> and 200 characters, on average). The list can be as
> large as, say, 50 strings. MOST of those 50 strings
> are identical, but do not necessarily have the same
> identity (hence I can't optimize with == without
> using .intern()). However, *sometimes* there will be
> a few that are different. I need to do a search to
> find the *first* string that does not match the top
> of the list. Furthremore, I'm doing this activity in
> a loop of several lists, so my complexity is O(n^3)
> (when you consider that the complexity of matching
> strings is O(n)). Also, I do not create the list, so
> I can't make the entries be interned so I can use ==.
>
Presumably you are doing a lot of these or are simply suggesting this as a theorectical example because the above numbers are trivially small.
Optimizations always require that one address the data first and not just the algorithm.
So simply this array is created at some point. At that time there is an insertion order. The insertion order defines the first element. Every element after tha it compare to it. If a mismatch is found then a flag is set. The array, the flag, and the insertion method are put into a class.
Thus during creation the problem is solved.
