Need sample code to do text search using boolean operators AND OR + -

I'm looking for an algorithm doing text searches in files

I need it to support AND OR + - keywords (for example "ejb AND peristence")

Does anyone knows where I cand find this kind of algorithm with the full source ?

Of course I can adapt C,C++ to Java.

In fact my target language is Serverside javascript (sorry) so I prefere rather low level solutions !

Any help will be grealy appreciated and the resulting code will be posted

here and on my website : http://www.tips4dev.com

[522 byte] By [caenevetc] at [2007-9-27 18:44:35]
# 1
There is a very good text search algorithm by Aho, Corasick. I'll try to search my uni notes to find it and I'll get back to this the next week. Just don't expect a low level solution.
JirMac at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 2
I'm looking forward to hearing from you next week !By low level I meant something that doesn't need other tools/libraries as my target language is a server side javascript
caenevetc at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 3
There is a search engine called Lucene - http://jakarta.apache.org/lucene/docs/index.html
euxx at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 4
write a algorithm?why dont you look at the java.util.regex.Matcher and java.util.regex.Pattern classes?
Poop at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 5

Firstly, a little note to the technical solution: what you probably need the most is speed. I may sound strange, but personally I am convinced that if you could use system tools a naive algorithm:for i:=1 to m do

grep (word[i])

od;

whose complexity is O(m.n), where m is the number of words to be processed and n somehow represents the cardinality of the text to-be-sought-through, so this naive algorithm would actually be in 99% of cases much faster than any implementation of the algorithm below, whose complexity is O(m+n), because the implementation of the grep routine (O(n)) would be optimized and m will be low (who queries 153 words at once?)

Anyway, you asked for an algorithm and you'll have it. It is quite elegant.

Aho, A.V. - Corasick, M.V.: Efficient String Matching - An Aid to Bibliographic Search, Communication of the ACM, 1975 (vol. 18), No. 6, pg. 333-340

The task: let's have an alphabet X and a string x = d1d2...dn (d's are characters from X) and a set K = {y1, ... ym} of words, where yj = tj,1 ... tj,l(j) (t's are again characters from X).

Now we search for all <i, yp> where yp is the suffix of d1...di (occurences of the word yp in x)

(note: if you want to search for the whole words tj,1 and tj,l(j) must be blanks)

The idea of the algorithm is that we first somehow process words yp to construct a search machine and with this machine we will loop through X to search for occurrences of all the words at once.

Example:

K = {he, she, his, hers}

X = ushers

search machine M(Q - set of states, g - "step forward" function, f - "step back" function, out - reporting function):

(function g)

0 (initial state) h-> state 1 e-> state 2 r-> state 7 s-> state 8 ... for {he, hers}

state 1 i-> state 6 s-> state 7 ... for {his}

state 0 s-> state 3 h-> state 4 e-> state 5 e ... for {she, he}

And for all the characters is defined 0 x -> 0

Now, in

(function out)

state 2: report {he}

state 5: report {she, he}

state 7: report {his}

state 9: report {hers}

"Step back" function f for this particular set of word would be:

9 -> 3, 7 -> 3, 5 -> 2, 4 -> 1 otherwise the machine would return to the initial state 0

Processing of ushers will look like:

<0,0> u-stay in the state 0 <1,0> s-move to state 3 <2,3>, <3,4>, <4,5> state 5-report (he, she}, cannot move forward -> must step back (like if "he" was received) <4,2> r-move to state 8, <5,8>, <6,9>

Before we show how to construct the searching machine M (Q,g,f,out) let抯 consider the algorithm how to use it:

Alg 1.

begin

state:= 0;

for i = 1 to n do

//if cannot move forward, move back

while g(state, di) not defined do state:=f(state) od;

//move forward to a new state

state:=g(state, di);

//report all the words represented by the new state

for all y from out(state) do Report(i,y) od;

od

end.

Alg 2. ?build of the 剆tep forward?function g and an auxiliary function o that will be later used for the construction of outvar q:integer;

procedure Enter(T1匱m);

begin

s:=0; j:=1;

//processing a prefix of a new word that is a prefix of an already processed word too

while j&ltm and g(s,Tj) defined do

s:=g(s,Tj); j:=j+1;

od;

while j&ltm do

q:=q+1; //a new state ?global variable

define g(s,Tj) = q; //definition of a single step forward

s:=q;

j:=j+1;

od;

//the last state must be a state when at least the processed word is reported

define o(s) = [T1, ?Tm];

end;

begin

q:=0; //initial state

for p:= 1 to k do Enter(yp) od;

for all d from the alphabet X do

if g(0,d) not defined then define g(0,d) = 0 fi

od

end.

Alg 3. ?build of the 剆tep back?function f and the reporting function outcreate an empty queue

define f(0) = 0; out(0) = {} //an empty set ?we expect words of the length 1 at least

for all d from X do

//process children of the initial state

s:=g(0,d);

if s!=0 then

define f(s) = 0; //1-character states, if we throw away the first character we return to the initial

define out(s):=o(s); //report 1-character words, if any

move s at the end of the queue

fi

od

while queue not empty do

r:= the first member of the queue; remove r from the queue;

for all d from X do //process all the children of r

if g(r,d) defined then

s:= g(r,d); //get a child of r

t:= f(r); //f(r) has already been defined

while g(t,d) not defined do t:=f(t) od;

//we found a state from which g(t,d) has sense

define f(s) = g(t,d);

define out(s) = o(s) UNION with out(f(s));

move s at the end of the queue;

fi

od

od

Processing of a query ?normal forms

Until now we have solved the problem how to search for multiple words in a text at once. The algorithm returns not only not only whether a word was found or not, but also where exactly a word can be found ?all the occurrences and their locations.

However, the initial task was slightly different: procession of a query like 揦 contains (y1 AND/OR y2 ?yn)?In order to decide a question like that it might not be necessary to find all the occurrences of given words, actually not even an occurrence of all the words (e.g. word1 OR word2 is fulfilled as soon as either word1, or word2 is found).

Let抯 suppose that a searching query is given in its disjunctive normal form (DNF):

A1 OR A2 OR ...Ak where each of Ax = B1 AND B2 AND ...Bkx and Byz is a statement 揦 contains yp?br>

Now, the query is successful whenever any of Ax is fulfilled.

(I don抰 know how much you know about transformation of a logical formula to its disjunctive form ?it is quite a famous algorithm and can be found in any textbook of logic or NP-completeness. I hope that evaluation of the formula, which is what happens in the procedure Report of the algorithm Alg. 1, is trivial.)

JirMac at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 6
I think you owe that guy some dukes ;-)
macrules2 at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 7
...at your service :-)
JirMac at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 8
Cool! Sometimes it seem to people not reading anything: books, documentation, manuals, specifications but think of themselves as of great programmers (just look at Java Programmingforum)... ;-) But now I see it's wrong, somebody still knows who's Alfred Aho...
i_am_neuron at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...
# 9
Thanks for the algorithm.I'm working on it
caenevetc at 2007-7-6 19:58:29 > top of Java-index,Other Topics,Algorithms...