Search for string within files

I have thousands of xml files stored at some location of unix machine. I need to extract only those xml files which have some particular pattern of numbers stored. Right now, i have 300 such distinct patterns to be searched among those xml files.

I am in search of some efficient way so that my searching time can be minimized.

I used unix "find" command as mentioned below but this single command is taking so much time!!

find . -exec grep -q"1-1Y3I93L"'{}' \; -print

where1-1Y3I93L is the pattern i need to search in xml file.

Can anyone help me in this regard? Plz tell me something to do it efficiently.

[709 byte] By [rak78a] at [2007-10-3 9:54:56]
# 1

Do you think some Java code (an algorithm implemented in Java) can outperform Linux' find command? I'm not saying it's impossible, just highly unlikely.

This sounds an awful lot like an earlier thread of yours (http://forum.java.sun.com/thread.jspa?threadID=703695). The answers given then are probably suitable for this one, I guess.

How did you tackle that problem, btw? By using a database?

prometheuzza at 2007-7-15 5:12:45 > top of Java-index,Other Topics,Algorithms...
# 2

I am already trying with grep and egrep but you know these tools sometimes failed because of huge quantity of files. I read error as "too many parameters" !!

So, again we need to write shell script on unix machines to do that. My problem here is not the searching but fast searching. Thats why I m asking opinions.

rak78a at 2007-7-15 5:12:45 > top of Java-index,Other Topics,Algorithms...
# 3
> ...> Thats why I m asking opinions.Ok, how about a ~!~DATABASE~!~?
prometheuzza at 2007-7-15 5:12:45 > top of Java-index,Other Topics,Algorithms...
# 4

Perhaps I misunderstand the question, but if you have thousands of files and three hundred patterns and you are using a unix find utility which looks for a single pattern, you must read each file 300 times to see if it contains any of the 300 patterns.

If that is the case you can minimize your reading of files by writing a routine that will look for any of the 300 patterns in a single read of a file. This will certainly be quicker than doing 300 reads of each single file.

Of course the other thing that you can do which may or may not be of any use to your particular situation is to build an index. IF your xml files are not changing all the time and IF your search patterns are not changing all the time you essentially keep track of which patterns were found in which files. Then to find files with any given pattern you just look in the index.

The index only makes sense if the data is either static or slowly changing and also you clearly must do all the work to build the index up front but again depending on your situation an index may be useful.

marlin314a at 2007-7-15 5:12:45 > top of Java-index,Other Topics,Algorithms...
# 5

> I have thousands of xml files stored at some location

> of unix machine. I need to extract only those xml

> files which have some particular pattern of numbers

> stored. Right now, i have 300 such distinct patterns

> to be searched among those xml files.

This is very easy with the Knuth-Morris-Pratt algorithm. Each file can be read just once and the KMP algorithm applied once per pattern. The total number of character comparisons will be less than "total file size" + "tolal pattern size" .

Correction - the total number of character comparisons is not easy to estimate since it depends much on the content of the patterns but it will be about "total file size"* "number of patterns".

Message was edited by:

sabre150

sabre150a at 2007-7-15 5:12:45 > top of Java-index,Other Topics,Algorithms...
# 6
Using the KMP approach with 300 random words my program output is1148 hits from 2397 files totaling 9047393 bytes taking 43 seconds
sabre150a at 2007-7-15 5:12:45 > top of Java-index,Other Topics,Algorithms...