preprocessing for string matching algorithm
hi i have a problem in which i have to match an array of strings with a given text(which might be very large). i have used both knuth morris pratt and boyor moore algorithm and am happy with the outcome.
but before checking the string i am supposed to do some preprocessing like
if i am given 2 words say:computer andcomputing.
oncecomputer string has been checked, in the usual algorithms to check the wordcomputing i have to start from the start of the text. but i want to do it such that i check only relevent words forcomputing, since i have alrerady checked forcomputer i know that only words which were similar to computer will work forcomputing
i need to do this kind of preprocessing. any ideas? or does anyone know if someone has worked on this before
please help and thanks in advance
Message was edited by:
av86
[928 byte] By [
av86a] at [2007-10-3 2:21:46]

Just rebuild the KPM state machine so that it works on a dictionary.
You want to build a FSA, a state graph, i.e. transition diagram where when you are in a given state and you get the next letter from the large text you transition to the next state. Some states are accepting states meaning that you have completed a match.
Every prefix for a string in your dictionary will essentially be a state, thus if the word "and" is in your dictionary of search strings there will be at least three states in your graph: "a" "an" "and"
Clearly, any prefix that is the entire string is an accepting state.
If you are in state "an" and you see the character "d" you transition to the state "and"
i.e. if the letter concatinates to the current state and yields another prefix state, then that is the state that you transition too.
If the character you just saw does not lead to a prefix like say you adjoin "o" and "ano" is not one of your prefix states, then you need to find the longest substring of your new string that is a prefix state. i.e. you looked at "ano" and it was not a prefix so you next look at "no" to see if that is in your prefix set and if not, you look at "o" to see if that is in your prefix set.
This building of the state graph has nothing to do with the long input string. It is depends only on the dictionary. Thus if you dictionary is known in advance the FSA can be built in preprocess.
To find your matches just march across your long text one character at a time and follow your state transitions.