Any better algorithms for regular pattern match search?

Hi, guys

Can you tell me what is the best way to do regular expression pattern serach. I am developping a message filter. Given all word patterns(patterns are all single word), I need find if each messages matchs one of the patterns. Once find one pattern match, program will stop the process for the message and move to next meeage. The complexity is O(nm). n - number of patterns, m - number of words of messages. It is based on linear search for each message. Any better way?

Thanks

[504 byte] By [rcd27a] at [2007-10-2 23:51:51]
# 1
This does not look to me like a candidate for a regex pattern search since your description implies your patterns are fixed length words. I suspect the Knuth-Morris-Pratt algorithm coud be usefull for this.
sabre150a at 2007-7-14 16:37:29 > top of Java-index,Other Topics,Algorithms...
# 2

There are a number of ways of doing this.

If your "patterns" are just words (i.e. no character classes, dot matches, character repitition, grouping etc.) then you can put all of your match words into a hash map. Tokenize each mesasge and then for each token see if it exists in the map. This takes O(m) time where m is the number of words in the messages.

If you need to support real regular expressions then you can OR ("|") the regular expressions together. running this regexp with take O(c) where c is the number of characters in your message. Unfortuantely for large numbers of patterns it can take an unreasonably large time to generate the regexp state machine. And you can't trivially figure out which pattern was actually matched.

If you want something more specific then this then you will need to descibe your problem in more detail.

matfud

matfuda at 2007-7-14 16:37:29 > top of Java-index,Other Topics,Algorithms...
# 3
Patricia trees are the other standard way of matching a word against a dictionary.
marlin314a at 2007-7-14 16:37:29 > top of Java-index,Other Topics,Algorithms...
# 4

Thanks guys.

The problem is: I have number of key words (just word, only contains a-z and A-Z), around 200 key words, I need find if each message contains such words or not. If find any one, will return and mark the message to next process by manually.

matfud's way takes O(m) time where m is the number of words in the messages.

How about other suggestion: Patricia trees and Knuth-Morris-Pratt algorithm? what is the time complexity?

rcd27a at 2007-7-14 16:37:29 > top of Java-index,Other Topics,Algorithms...
# 5

Patricia has the same basic characteristics as hashing. It may be viewed as a hash algorithm that has been tuned to discriminate only those words in the dictionary. Being a binary tree since 200 is about 2^8 you would expect to find a hit or a miss in about 10 operations on the average per word discrimination.

There is not much way around the O(m) since you have already said you must test each word in the message.

Don't worry about time complexity, in any real problem you're going to be IO bound anyway. Just hash and be done with it.

marlin314a at 2007-7-14 16:37:29 > top of Java-index,Other Topics,Algorithms...