Soundex

Hey Folks,

I am trying to find a better implementation of Soundex algorithm than the org.apache.commons.codec.language.RefinedSoundex or

org.apache.commons.codec.language.DoubleMetaphone. Has anyone any idea whether there are other implementations available? We are getting lots of false positives.

[318 byte] By [rhajikhani1a] at [2007-10-2 23:05:57]
# 1

define better

Soundex is supposed to give false positives. Its purpose is to take anglicized names that may have come from different language groups and thus have no cannonical spelling and mash them into the same group because they might sound the same and perhaps should be considered to be the same.

If your soundex is not mashing Tchaikovski, Chucoughsky, Tchkovsky, chewkafski all together it is not working.

marlin314a at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 2

Well, better means not confusing "CA" with "KKK" or "Hell" with "A.. Hole". That's the kind of false positives that we are trying to avoid.

Better means that if a non-english/western name is provided, the functionality would not mistake it with some curse word. After all, there are consumers in the US with non-western names.

Thanks in advance.

-r

rhajikhani1a at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 3

> Well, better means not confusing "CA" with "KKK" or

> "Hell" with "A.. Hole". That's the kind of false

> positives that we are trying to avoid.

> Better means that if a non-english/western name is

> provided, the functionality would not mistake it with

> some curse word. After all, there are consumers in

> the US with non-western names.

> Thanks in advance.

The basic algorithm for soundex is pretty simple. If you are having problems with the ones you can find, I would either lok at the source and try to expand upon it or write your own.

As far as curse words go, why are you looking up curse words? Can't you filter those out of the search entirely? Well every thing but rick with a d intstead of an r, of course. I mean, if you are looking for rick, you should find rick, right? With a 'd' of course.

dubwaia at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 4

Sure. I can find rick spelled with a "d". But how about "ri*ck" or "_rick" spelled with a "d"? If I did a simple look up, then I would miss the "*" or "any other character in the middle of the name/word" issue!

As to your first point, I was hoping of avoiding to re-inventing the wheel.

-r

rhajikhani1a at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 5

> Sure. I can find rick spelled with a "d". But how

> about "ri*ck" or "_rick" spelled with a "d"? If I did

> a simple look up, then I would miss the "*" or "any

> other character in the middle of the name/word"

> issue!

I think you missed my point. I was asking why you are not filtering curse words out of the search if you don't want them.

> As to your first point, I was hoping of avoiding to

> re-inventing the wheel.

You aren't really re-inventing it, your are making a new wheel based on an existing design which is something people do all the time.

dubwaia at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 6

>> Sure. I can find rick spelled with a "d". But how

>> about "ri*ck" or "_rick" spelled with a "d"? If I did

>> a simple look up, then I would miss the "*" or "any

>> other character in the middle of the name/word"

>> issue!

>I think you missed my point. I was asking why you are not filtering curse words out >of the search if you don't want them.

I am sorry but I still don't get it! What do you mean exactly?

-r

rhajikhani1a at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 7

> >I think you missed my point. I was asking why you

> are not filtering curse words out >of the search if

> you don't want them.

>

> I am sorry but I still don't get it! What do you mean

> exactly?

> -r

From what you said before, it sounds to me that you don't want to match curse words when the name 'sounds like' one. The simplest approach is to filter out curse words before even checking them through the soundex algorithm. It doesn't make sense to make the soundex algorithm not match things that should match per it's algorihm.

dubwaia at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 8

But wouldn't that defeat the very purpose of using Soundex?!

I mean I want to catch curse words in the "user provided" information. At the same time, I would like to make sure "CA" is not mistaken for "KKK". If I were to remove "A..hole" from my spoof table, which is what I am using in conjuction with Soundex, then when a user does put the word "A..hole" in a text box, I wouldn't be able to catch it.

I am of course just ranting at this point.

Thanks in advance.

-r

rhajikhani1a at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 9

> But wouldn't that defeat the very purpose of using

> Soundex?!

> I mean I want to catch curse words in the "user

> provided" information. At the same time, I would like

> to make sure "CA" is not mistaken for "KKK". If I

> were to remove "A..hole" from my spoof table, which

> is what I am using in conjuction with Soundex, then

> when a user does put the word "A..hole" in a text

> box, I wouldn't be able to catch it.

I don't really have enough context to answer your question here but I wonder if you really want to use Soundex. Maybe you want something a little more sophisticated like a string distiance algorithm (e.g. Levenshtein.)

dubwaia at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...
# 10

Apparently you are NOT looking for a better Soundex algorithm.

As I said, the purpose of the Soundex algorithm as originally developed was for doing approximate name matching in imigration and census type of data. It works well for that purpose. Soundex is supposed to and does in fact match KKK to CA because they do in fact match in sound. You don't appear to want that.

You are trying to do some random other thing which you have not yet specified that either allows users to type KKK into some state field, or else allows validating states against a list of swear words or something. You have not yet told us what you are trying to do.

You apparently have grabbed some random component (Soundex) off the shelf, have recruited it to do something for which it was never intended, and are now looking for something else. But all you have told us is that Soundex doesn't work for you.

Now if you want help, you need to explain what you are trying to do.

marlin314a at 2007-7-14 6:19:55 > top of Java-index,Other Topics,Algorithms...