search algorithms for searching a body of text
Im looking for an efficent search algorithm which searches a body of text (under 3,000 words) looking for keywords.
What would be the best algorithm to employ.
The body of text may be a XML document (in which case id like to be able to search the XML elements eg search for 'Alan Turing' in the element tag author)
Thanks in advance for any pointers
To clarify a few points.The document will be one ive never seen before.I'd prefer to be able to search the XML b4 I parse it (if the keyowrds dont match, then XML doc will not be parsed)Ill be using DOM parsing.Thanks
Would the following be of any interest to you? http://www.ics.uci.edu/~eppstein/161/960227.htmlIt's the good ol' Knuth-Morris-Pratt algorithm; this on realy pays off when you're looking for smallpatterns in big text strings/files.kind regards,Jos
> What would be the best algorithm to employ.
> Ill be using DOM parsing.
DOM is definitely the wrong way if you wan't to use the "best" algorithm, since DOM takes alot of resources. AND you cannot "cancel" a DOM-parsing before the structure is entirely read, parsed and put into the DOM structure. A Sax-parsing allows you at any point terminate the parsing when you have gotten the xml-tags you needed to read to determine "author" or whatever you wanted to determine. And if the conditions to read the entire xml-document are true then you just continue to read til the end of the document.
Gil
> Would the following be of any interest to you?
> http://www.ics.uci.edu/~eppstein/161/960227.html
> It's the good ol' Knuth-Morris-Pratt algorithm; this
> on realy pays off when you're looking for small
> patterns in big text strings/files.
>
Wasn't very impressed by the performance O(n) in worst case and in average of the algorithm above. Here is some code that performs O(n) (n=length of data, m=length of sought pattern) in worst case and O(n/m+m*h) (h=found hits of m) in statistical average. That is: if looking for "Pattern" in a large file, we only look at every 7th byte in the file unless there is a hit. This can be tuned further. One cool thing is that this algorithm very easily be adapted to look for multiple sought patterns with pretty much the same performance.
/**
* @return positions in _data the sought _find was found at
*/
public static int[] match(byte[] _data, int _offset, int _length, byte[] _find) {
if(_offset+_length>_data.length){
throw new Exception("Illegal offset and length for given data");
}
//index for all this is the byte-code
int[] count=new int[256];//count occurances in find
int[][] positions=new int[256][]; //points out positions in find
//build count and positions
// - must be done in reverse order to get the results in right order
for (int i = _find.length-1; i >=0; i--) {
int b=_find[i]+128;
int[] pos=positions[b];
if(pos==null){
pos=positions[b]=new int[1];
}else if(pos.length<=count[b]){
pos=positions[b]=doubleVector(positions[b]);
}
pos[count[b]++]=i;
}
//result
int[] result=new int[1];
int index=0;
//start parsing
int max=_offset+_length;
int jump=_find.length;
//Starts looking from first position. Always checks the last postion
for(int i = _offset; i < max; i=((i+jump>=max && i!=max-1) ? max-1 : i+jump)) {
if(count[_data[i]+128]>0){
int d=_data[i]+128;
for(int j=0; j<count[d]; j++){//try to match from this position on
boolean match=true;
int startData=i-positions[d][j];
int findPos=0;
if(startData><0){
startData=0;
findPos= startData-(i-positions[d][j]);
}
for(int k=startData; match && k<_data.length && findPos<_find.length; k++){
match=_data[k]==_find[findPos++];
}
if(match){
if(index>=result.length){
result=doubleVector(result);
}
result[index++]=i-positions[d][j];
}
}
}
}
if(index!=result.length){
result=resizeVector(result,index);
}
return result;
}
public static int[] resizeVector(int[] _vector, int _newSize){
int[] newVector=new int[_newSize];
System.arraycopy(_vector,0,newVector,0,(_vector.length<_newSize ? _vector.length : _newSize));
return newVector;
}
public static int[] doubleVector(int[] _vector){
return resizeVector(_vector, _vector.length*2);
}
Gil
Important note about above code: Will give partial results that occurs at the beginning or the end of data. Eg if data is "Hello World!" and we search for "But Hello", it will return position -4 as a hit. This can easily be filtered or used when reading chunks from a large
Ive finally decided on the Boyer-moore string matching algorithm
Thanks gilroitto for yr helpful responses, btw Im using DOM as I will need to manipulate the XML document elements and this is difficult with SAX.
Thanks JosHorsmeier for suggesting Knuth-Morris-Pratt algorithm.
mellowmoose
Hmm, never seen Boyer-Moore before. But it was really good. I think the only time my previous algorithm is better is for searching multiple patterns. I couldn't resist implementing the Boyer-Moore algorithm based on the suggestion here:
http://www.blarg.com/~doyle/bmi.html
public int[] bmh(byte[] data, byte[] pattern){
int [] shift = makeShift(pattern);
return bmh(data, pattern, shift);
}
public int[] makeShift(byte[] pattern){
int m=pattern.length;
int [] shift = new int[ 256 ];
//Preprocess. Fill the shift array.
for ( int k = 0; k < 256; k++ )
shift[ k ] = m;
for ( int k = 0; k < m - 1; k++ ) {
int ndx = pattern[ k ] + 128;
shift[ ndx ] = m - 1 - k;
}
return shift;
}
/**
* @return positions in data of found patterns
*/
public static int[] bmh(byte[] data, byte[] pattern, int[] shift){
intm = pattern.length,
n = data.length;
//result
int[] result=new int[1];
int resultIndex=0;
// Search for pattern in data.
int ndx=0;
for(int i = 0; i <= n - m; i += shift[ ndx ] ){
for(int j= m - 1; data[ i + j ] == pattern[ j ] ; j--){
if ( j <= 0 ){
if(resultIndex>=result.length){
result=doubleVector(result);
}
result[resultIndex++]=i;
break;
}
}
// Shift right using the current last letter in data.
ndx = data[ i + m - 1 ] + 128;
}
if(resultIndex!=result.length){
result=resizeVector(result,resultIndex);
}
return result;
}
public static int[] resizeVector(int[] _vector, int _newSize){
int[] newVector=new int[_newSize];
System.arraycopy(_vector,0,newVector,0,(_vector.length<_newSize ? _vector.length : _newSize));
return newVector;
}
public static int[] doubleVector(int[] _vector){
return resizeVector(_vector, _vector.length<<1);
}
Gil
