Best string matching algorithms

Hi,

I have implemented the Levenshtien's algorithm for string matching to find the insertions, deletions and substitutions for converting one string to another. But, this works fine for small strings about 256 characters.

Can anyone give me a suggestion as to which algorithm will work best for comparing strings(data) of more than 256 characters( about 2GB if it is a clob field)?

Thanks in advance

[425 byte] By [seemap123a] at [2007-10-3 8:53:07]
# 1

I do not think that there is any particular alternative to the standard dynamic programming alogrithm for the calculation of the Levenshtien metric between two strings.

However if you know something about the nature of the strings that you are comparing you can shortcut the process by making assumptions about what the strings represent and how they were created.

The primary example of this is the diff utility that is used to see what lines have been added, removed, moved around or changed between an old version of a file and a new version.

The assumption that is made is that lines are significant in many text files (particularly programming source code) and thus if you consider each file as a sequence of lines instead of being a sequence of characters you can process the file at a higher scale than the character scale. Typically what is done is that you stuff each line from one file into a hash table so that you can then hash each line in the second file and look for matches. This allows you to quickly detect large portions that remain unchanged between the two files. This method also deals with moving large chunks of lines (paragraphs or modules) into a different order without otherwise changing them.

Once you have done alignment at the grand scale, you can then take individual lines that appear to be similar (for instance they substantially match on character histograms across the line) and run the full Levenshtien metric on those individual lines.

Basically if your data has some bulk statistics at some higher levels of organization you might find that you can do this same sort chunking that was done in diff.

So what are you comparing that is 2G long - base pairs for DNA sequences?

marlin314a at 2007-7-15 4:02:54 > top of Java-index,Other Topics,Algorithms...
# 2

Thanks a lot for the prompt reply.

I'll answer the questions you have asked me.

Q : So what are you comparing that is 2G long - base pairs for DNA sequences?

A: No, it is not a particular kind of data. It is for part of an application where data from two files are compared line by line. When the user selects one line from the test data and another from source, I want to compare each character.

Q: However if you know something about the nature of the strings that you are comparing you can shortcut the process by making assumptions about what the strings represent and how they were created.

A: The kind of data is unknown. It can be a string field or a integer field or a clob field in the database.

Q: Once you have done alignment at the grand scale, you can then take individual lines that appear to be similar (for instance they substantially match on character histograms across the line) and run the full Levenshtien metric on those individual lines.

A: Yes, I'm applying the Levenshtien metric on those individual lines. But, the problem is when the data is too big, it takes more time to compute and slows down. Hence, I thought depending on the size of data, i.e if

the size is within 256 characters, I can apply the Levenshtien algorithm. For size more than 256 I need an alternative that can be approximate but fast.

Any further help is appeciated.

seemap123a at 2007-7-15 4:02:54 > top of Java-index,Other Topics,Algorithms...
# 3
A suffix tree can be used to do fast matching. It takes in a single word and creates a tree containing paths that represent all the substrings of the word. A suffix tree can be created in O(N) time.
rkippena at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...
# 4

> A suffix tree can be used to do fast matching. It

> takes in a single word and creates a tree containing

> paths that represent all the substrings of the word.

> A suffix tree can be created in O(N) time.

That might be good enough but it won't handle the kinds of things that levenstein can, right? I mean, an insertion in the middle will throw things off, will it not?

dubwaia at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...
# 5

> if the size is within 256 characters, I can apply the

> Levenshtien algorithm. For size more than 256 I need

> an alternative that can be approximate but fast.

Kind of off the cuff here but I have a little familiarity with the algorithm and the size issue is related to the matrix that is created, right? Could you modify the algorithm to take sections that are exactly the same and replace them with a single character marker? I'm thinking something outside of the character set like 0x0000. You could do this by taking sections and running levelstein to find duplicate areas. If you hit too many differences in a section, you could just move on to another section. If the two strings are very different, you could just mark them as substantially changed.

Another idea would be to take multiple character sets and create a string of the hashes for each of them. It would be a very rough approximation and would give false negatives but could eliminate positive matching sections quickly allowing a more detailed look at the negative matching areas.

dubwaia at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...
# 6

Let me perhaps restate what I said before and ask it as a direct questions. What are you trying to do?

There is nothing magic aboug Levenschtien. It is a metric, a minimum edit metric composed of a few atomic elements like insertion and deletion of character elements.

When you are doing a diff between two program files, you do NOT use Levenschtien because in the way that people do actual edits, we think of grabing a large paragraph from the front and moving it to the back as a single step. Not a whole bunch of character deletes followed by a whole bunch of character insertions.

You get better information (in terms of knowing how a source file got shuffled around) by looking at the output of diff than you get by computing a Levenschtien metric between two files.

My point is that Levenschtien is merly one of many possible edit distance metrics that one could create. It was NOT optimal for computing the differences between two source files. It happens to be the one edit metric that has a name attached to it for historical reasons. That does not mean that it is optimal for your purpose, or in fact optimal for any purpose other than teaching dynamic programming.

When you say that you have a clob that you are comparing to another clob (random clobs pulled from a random database with no a priori reason to suppose them to be even in the same language) and your application needs to return the Levenschtien distance between them? I ask WHY?

If that really is what you need, I stick by what I said which is that I don't know of any method that will shortcut the monster sized N squared standard dynamic programming solution for Levenschtien.

If you say an approximation will do - now you're talking.

But a presuposition for having an acceptable approximation is a definition of acceptable. What is acceptable? What are you trying to do?

marlin314a at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...
# 7

Thanks for the replies.

Q: Let me perhaps restate what I said before and ask it as a direct questions. What are you trying to do?

A: I would like to find the character element difference between two datas. I also would like to know the insertions , deletions and substitutions. So when I compute the changes, I color the respective characters in the GUI. The data can be a string, integer or a clob field.

eg : Base string = "Hello world"

Test string = "Hello world Application"

I have done the above said by implementing the Levenshtien's algorithm. It works fine.

The problem I am having is when the string is too big for example 2K, then Levenshtien's is not the right choice as it takes too much of time to compute and also memory because of the matrix. Hence I would like to know what would be a better alternative.

Q: When you say that you have a clob that you are comparing to another clob (random clobs pulled from a random database with no a priori reason to suppose them to be even in the same language) and your application needs to return the Levenschtien distance between them? I ask WHY?

A: I'm not saying that when two clobs are compared I have to use Levenschtien distance. But I'm asking what would be an other alternative other than Levenschtien distance.

Q: If you say an approximation will do - now you're talking.

But a presuposition for having an acceptable approximation is a definition of acceptable. What is acceptable? What are you trying to do?

A: Yes, an approximation will do. By approximation I mean to say now I can just compare the two strings and whenever a difference is found, I make a note of the position at which the difference is found. I do not need to know if it is an insertion/deletion/substitution.

seemap123a at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...
# 8

> A: Yes, an approximation will do. By approximation I

> mean to say now I can just compare the two strings

> and whenever a difference is found, I make a note of

> the position at which the difference is found. I do

> not need to know if it is an

> insertion/deletion/substitution.

Why can't you just do a simple char by char compare? I'm guessing this isn't acceptable but I think answering this question will help you define what it is you want to do.

dubwaia at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...
# 9

try block matching

attempt to match the first x bytes.

if they match then ignore them

if the do not match then recursively split the blok until you find out exactly were they start to differ.

when you find out where they start to differ you start searching though both files until you find out where they start to match again. This can be done fairly efficiently using running hashcodes over blocks of data (takes a lot of memory to store the hash for each set of x data blocks)

when you find two hashes that match then you check those blocks of data or equality.

This is failry efficient if you add huristics to limit the maximum difference it will handle (the max block size you are interested in). It is limited to noticing changes that at least the size of the block used in the unning hash algorithm.

matfuda at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...
# 10

Check out a sourceforge project called SecondString, it provides an API for String Matching with many algos like JaroWinkler, and TFIDF, and other stuff like machine learning and training, etc. Although the documentation is a little poor, it may suit your needs.

Message was edited by:

mping

mpinga at 2007-7-15 4:02:55 > top of Java-index,Other Topics,Algorithms...