Hash function that returns the same value to an 'almost similar' objects

Hi all.

I'm looking for a hash function, that will return the same value to two strings, that are almost similar to each other.

what do I mean?

for example, if I have a source code (as a string) of a that function:

'void f() {a += 1;}'

and that function:

'void f() {b += 1;}'

the Hash value of that strings will be the same.

someone know something that can help me with that?

Thanks.

[449 byte] By [meirrotsteina] at [2007-11-27 8:53:15]
# 1
You could calculate the Levenshtein distance between the two Strings and base their hash code on the distance. http://en.wikipedia.org/wiki/Levenshtein_distance
prometheuzza at 2007-7-12 21:10:13 > top of Java-index,Java Essentials,Java Programming...
# 2
When it all comes down to it, there is no 'magic' going on with computers or computer programming.Object matching in hashtables is all encapsulated using equals() and hashcode(). modify these two methods to suit your purposes and you can use any type of map
tjacobs01a at 2007-7-12 21:10:13 > top of Java-index,Java Essentials,Java Programming...
# 3
> You could calculate the Levenshtein distance betweenWouldn't you have to then compare all elements in your Map? wouldn't this then effectively make the map a list?
tjacobs01a at 2007-7-12 21:10:13 > top of Java-index,Java Essentials,Java Programming...
# 4

> ...

> Wouldn't you have to then compare all elements in

> your Map? wouldn't this then effectively make the map

> a list?

Yes, you're right, the comparison based on the Levenstein distance is more suitable for Comparable/Comparator stuff.

I don't see how else a comparison between Strings can be made.

prometheuzza at 2007-7-12 21:10:13 > top of Java-index,Java Essentials,Java Programming...
# 5

You really can't use hash values for fuzzy matching. Suppose you have three strings, you can have two which are "similar" to the third, whereas the first two are not "similiar" to one another. (e.g. they differ from the 3rd string in different ways).

Hashtables depend on the properties of equals and hashCode being consistent, and, by implication if A equals B and B equals C then A equals C. This rule is violated for a fuzzy match and hash tables just aren't applicable.

malcolmmca at 2007-7-12 21:10:13 > top of Java-index,Java Essentials,Java Programming...
# 6
Thank u all!Levenstein distance sounds good, i'll check it.
meirrotsteina at 2007-7-12 21:10:13 > top of Java-index,Java Essentials,Java Programming...
# 7

> 'void f() {a += 1;}'

>

> and that function:

>

> 'void f() {b += 1;}'

>

> the Hash value of that strings will be the same.

> someone know something that can help me with that?

To make this work, you need to replace the actual text by symbolic values. For example (symbols in caps):

"void FUNCTION_NAME() { VARIABLE += 1;}"

What you translate will depend on what you're trying to match ... for example, you could also use

"TYPE_NAME FUNCTION_NAME() { VARIABLE EXPRESSION CONSTANT ; }"

To do this, you're going to need to write a parser to process the initial strings. Once you've written this, and can produce a string with actual values replaced by symbols, it's just a matter of calling hashCode() on that result.

Mind you, this is not particularly efficient.

The real question here is: what are you trying to do? It looks like something to detect copy-and-paste code. In which case you might want to look into a tool like PMD: http://en.wikipedia.org/wiki/PMD_%28software%29

kdgregorya at 2007-7-12 21:10:13 > top of Java-index,Java Essentials,Java Programming...