Need sample code to do text search using boolean operators AND OR + -
I'm looking for an algorithm doing text searches in files
I need it to support AND OR + - keywords (for example "ejb AND peristence")
Does anyone knows where I cand find this kind of algorithm with the full source ?
Of course I can adapt C,C++ to Java.
In fact my target language is Serverside javascript (sorry) so I prefere rather low level solutions !
Any help will be grealy appreciated and the resulting code will be posted
here and on my website : http://www.tips4dev.com
[522 byte] By [
caenevetc] at [2007-9-27 18:44:35]

Firstly, a little note to the technical solution: what you probably need the most is speed. I may sound strange, but personally I am convinced that if you could use system tools a naive algorithm:for i:=1 to m do
grep (word[i])
od;
whose complexity is O(m.n), where m is the number of words to be processed and n somehow represents the cardinality of the text to-be-sought-through, so this naive algorithm would actually be in 99% of cases much faster than any implementation of the algorithm below, whose complexity is O(m+n), because the implementation of the grep routine (O(n)) would be optimized and m will be low (who queries 153 words at once?)
Anyway, you asked for an algorithm and you'll have it. It is quite elegant.
Aho, A.V. - Corasick, M.V.: Efficient String Matching - An Aid to Bibliographic Search, Communication of the ACM, 1975 (vol. 18), No. 6, pg. 333-340
The task: let's have an alphabet X and a string x = d1d2...dn (d's are characters from X) and a set K = {y1, ... ym} of words, where yj = tj,1 ... tj,l(j) (t's are again characters from X).
Now we search for all <i, yp> where yp is the suffix of d1...di (occurences of the word yp in x)
(note: if you want to search for the whole words tj,1 and tj,l(j) must be blanks)
The idea of the algorithm is that we first somehow process words yp to construct a search machine and with this machine we will loop through X to search for occurrences of all the words at once.
Example:
K = {he, she, his, hers}
X = ushers
search machine M(Q - set of states, g - "step forward" function, f - "step back" function, out - reporting function):
(function g)
0 (initial state) h-> state 1 e-> state 2 r-> state 7 s-> state 8 ... for {he, hers}
state 1 i-> state 6 s-> state 7 ... for {his}
state 0 s-> state 3 h-> state 4 e-> state 5 e ... for {she, he}
And for all the characters is defined 0 x -> 0
Now, in
(function out)
state 2: report {he}
state 5: report {she, he}
state 7: report {his}
state 9: report {hers}
"Step back" function f for this particular set of word would be:
9 -> 3, 7 -> 3, 5 -> 2, 4 -> 1 otherwise the machine would return to the initial state 0
Processing of ushers will look like:
<0,0> u-stay in the state 0 <1,0> s-move to state 3 <2,3>, <3,4>, <4,5> state 5-report (he, she}, cannot move forward -> must step back (like if "he" was received) <4,2> r-move to state 8, <5,8>, <6,9>
Before we show how to construct the searching machine M (Q,g,f,out) let抯 consider the algorithm how to use it:
Alg 1.
begin
state:= 0;
for i = 1 to n do
//if cannot move forward, move back
while g(state, di) not defined do state:=f(state) od;
//move forward to a new state
state:=g(state, di);
//report all the words represented by the new state
for all y from out(state) do Report(i,y) od;
od
end.
Alg 2. ?build of the 剆tep forward?function g and an auxiliary function o that will be later used for the construction of outvar q:integer;
procedure Enter(T1匱m);
begin
s:=0; j:=1;
//processing a prefix of a new word that is a prefix of an already processed word too
while j<m and g(s,Tj) defined do
s:=g(s,Tj); j:=j+1;
od;
while j<m do
q:=q+1; //a new state ?global variable
define g(s,Tj) = q; //definition of a single step forward
s:=q;
j:=j+1;
od;
//the last state must be a state when at least the processed word is reported
define o(s) = [T1, ?Tm];
end;
begin
q:=0; //initial state
for p:= 1 to k do Enter(yp) od;
for all d from the alphabet X do
if g(0,d) not defined then define g(0,d) = 0 fi
od
end.
Alg 3. ?build of the 剆tep back?function f and the reporting function outcreate an empty queue
define f(0) = 0; out(0) = {} //an empty set ?we expect words of the length 1 at least
for all d from X do
//process children of the initial state
s:=g(0,d);
if s!=0 then
define f(s) = 0; //1-character states, if we throw away the first character we return to the initial
define out(s):=o(s); //report 1-character words, if any
move s at the end of the queue
fi
od
while queue not empty do
r:= the first member of the queue; remove r from the queue;
for all d from X do //process all the children of r
if g(r,d) defined then
s:= g(r,d); //get a child of r
t:= f(r); //f(r) has already been defined
while g(t,d) not defined do t:=f(t) od;
//we found a state from which g(t,d) has sense
define f(s) = g(t,d);
define out(s) = o(s) UNION with out(f(s));
move s at the end of the queue;
fi
od
od
Processing of a query ?normal forms
Until now we have solved the problem how to search for multiple words in a text at once. The algorithm returns not only not only whether a word was found or not, but also where exactly a word can be found ?all the occurrences and their locations.
However, the initial task was slightly different: procession of a query like 揦 contains (y1 AND/OR y2 ?yn)?In order to decide a question like that it might not be necessary to find all the occurrences of given words, actually not even an occurrence of all the words (e.g. word1 OR word2 is fulfilled as soon as either word1, or word2 is found).
Let抯 suppose that a searching query is given in its disjunctive normal form (DNF):
A1 OR A2 OR ...Ak where each of Ax = B1 AND B2 AND ...Bkx and Byz is a statement 揦 contains yp?br>
Now, the query is successful whenever any of Ax is fulfilled.
(I don抰 know how much you know about transformation of a logical formula to its disjunctive form ?it is quite a famous algorithm and can be found in any textbook of logic or NP-completeness. I hope that evaluation of the formula, which is what happens in the procedure Report of the algorithm Alg. 1, is trivial.)