Optimized searching from large pool of flat files

Hi,

I need to optimize a searching algorithm. We have a large pool of flat text files (unicode format) having various lines in each file. Every single file represents data about some microchip. Every piece of data in a line is separated with a space. Each line contains 11 fields and we mostly search on first three fields. Its worth nothing here that first 3 fieldss contain numbers only.

This large pool of files is increasing day by day (currently it reached to 1.2 Gb of size!!). I want some very efficient searching technique to search fields in smallest possible time period.

Note: Currently i have employed simple sequential searching (read one file, scan every line one by one etc).

[723 byte] By [rak78a] at [2007-10-2 11:01:35]
# 1
Sounds like a job for a...~!~DATABASE~!~And, honestly, writing the code to convert your files to a database will probably be a lot simpler than implementing your own DB on top of the files.~Cheers
Adeodatusa at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 2

Y es thats the reply I was expecting, but let me tell you that we can't employ database here for some critical reasons. Actually this data represents GSM sim card information that has been shipped with SIM cards from manufacturer. All this data resides on files (as produced by) circuitry when burnings chips.

So, we can't afford to hire a team of KPOs for data entry!! Rather it would be very suitable to write something efficient algorithm for searching from this file.

Secondly, database has its own techniques (like hashing, indexing etc) to perform search smoothly, so if DB was the solution, I don't need to bother guys here at this forum. HELP Please!!

rak78a at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 3

I'd still put the data in a database. Sounds like they are in a machine readable format so they can be inserted mechanically without an army of data entry people.

But if you insist on doing it a harder way: read the text in the files a line at a time. Compare fields. Print line if matches.

If your first thought is efficiency perform some measurements first. The bottleneck is likely to be your disk drive, not the search program. Any reasonable modern CPU can do plain old linear text search faster than all but multichannel multicontroller RAID arrays can deliver. Of course a properly indexed database is still magnitudes faster.

sjasjaa at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 4

> So, we can't afford to hire a team of KPOs for data

> entry!! Rather it would be very suitable to write

> something efficient algorithm for searching from this

> file.

A very efficient algorithm for searching in large numbers of files? I'm sorry, but it still seems like you're trying to implement a database. I mean, if you can parse the files you have already, you can put that information into a DB automatically. And really, using a DB is a lot easier than making one and then using it.

> Secondly, database has its own techniques (like

> hashing, indexing etc) to perform search smoothly, so

> if DB was the solution, I don't need to bother guys

> here at this forum. HELP Please!!

There's no reason you couldn't use hashing and indexing and etc. for your search, just that somebody's already invented the wheel for you.

If you are really set on implementing your own database, you can start (I guess) by storing field files: all the files with some property P get their name stored in a file called P. Thus when you want to find them, you go to file P instead of searching through all files for ones with the property P.

~Cheers

p.s. Don't re-invent the wheel.

Adeodatusa at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 5

Part of the reason to recommend a database is that you have used a very vague term. You want to do searching.

That one word can cover a whole lot of territory. Do you want to do boolean searching "all records haveing This and That but not whatever"

Do you need to do range searching? "all values greater than 4 and less than 137" Do you need to do partial value matching like "foo*.bat"

If you need to do a bunch of complicated matches, sometimes this and sometimes that, well, that is what databases are for.

If on the other hand you really want to do something simple, you need to tell us the really simple thing that you want to do and then, perhaps, we can suggest to you a simple kludge that will do what you want.

I am going to guess that if you do have a simple search and it is based primarily on three numeric fields, that you wil end up doing something like building an index, meaning a single file, that lists the 3 values that you like from each of your millions of files so that instead of opening and reading a million files you read a single one and search through that.

But unless what you want to do is really simple, I would have to agree with the advice that it sounds like a DB problem.

marlin314a at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 6

> This large pool of files is increasing day by day

> (currently it reached to 1.2 Gb of size!!).

You call that big? That's nothing. I suggest using grep or egrep for simple searching.

> I want

> some very efficient searching technique to search

> fields in smallest possible time period.

The smallest possible, huh? Where is that requirement coming from? So if you can do it in 1 nanosecond but it's possible to do it in half that time you want to do the latter even if it costs much more to develop? You need to figure out what is fast enough and work from there.

dubwaia at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 7

First of all, I also agree with the rest in favour of a DB, maybe the lucene search engine or so.

You still can optimize in the query part, when you have a sufficient number of parallel requests.

Say you have one query worker searching files F[0] .. F[n-1] for query Q[k].

It momentarily is inspecting F[f].

Now a query Q[k+1] is wanted.

The first query worker can add Q[k+1] to Q[k] for F[f]..F[n-1], severely reducing file operations. Only requiring a worker for F[0]..F[f].

So in general you have query workers reading files for a set of queries,yielding a result for a range of files.

joop_eggena at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 8

I don't know if this is what he is looking for in reference to "grep" but here is something I wrote. /**A simple grep too used to search text files.

*@author Matthew DeLong

**/

import java.io.*;

import java.util.StringTokenizer;

public class MyGrep{

public static void main(String[] args){

String inFile = null;

String inText = null;

if(args.length < 2){

System.out.println("You must select a file, and text to read!");

try{

InputStreamReader reader = new InputStreamReader(System.in);

BufferedReader console = new BufferedReader(reader);

System.out.print("Please type your input file: ");

inFile = console.readLine();

InputStreamReader reader2 = new InputStreamReader(System.in);

BufferedReader console2 = new BufferedReader(reader2);

System.out.print("Please type your input text to find: ");

inText = console2.readLine();

}

catch(IOException e){

System.err.println("Problem with input file: " + e);

System.exit(1);

}

}

else{

inFile = args[0];

inText = args[1];

}

//readers for input file

BufferedReader fileOne = null;

int count = 0;

try {

//checks to see if the inText matches soemthing from the line.

int lineNum = 0;

fileOne = new BufferedReader(new FileReader(inFile));

String line = fileOne.readLine();

while(line != null){

lineNum++;

int textSize = inText.length();//size of the inText

for(int i=0;i<line.length();i++){

//System.out.println(i+" "+ line + " " +lineNum);

if(i>=line.length()-textSize){

if(line.substring(i,line.length()).equals(inText)){

count++;

System.out.println(lineNum+": "+line);

}

}

if(i<line.length()-textSize){

if(line.substring(i,i+textSize).equals(inText)){

count++;

System.out.println(lineNum+": "+line);

}

}

}

line = fileOne.readLine();

}

System.out.println("Lines containing " +inText+": "+count);

}

catch(FileNotFoundException e){

System.err.println("File not found: " + e.toString());

}

catch(IOException e){

System.err.println("Error with file: " + e.toString());

}

catch(IndexOutOfBoundsException e){

System.err.println("Index is out of Bounds: " + e.toString());

e.printStackTrace();

}

catch(NullPointerException e){

System.err.println("Error with String(NullPointerException):" + e);

e.printStackTrace();

}

finally{

try {

fileOne.close();

}

catch(IOException e){

System.err.println("Error closing file" + e);

}

}

}

}

>

MattD_4a at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 9
> Note: Currently i have employed simple> sequential searching (read one file, scan every line> one by one etc).Sorry, I missed that little note.
MattD_4a at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...
# 10
> Sorry, I missed that little note.That's not all you missed. This thread is two months old.
Adeodatusa at 2007-7-13 3:32:37 > top of Java-index,Other Topics,Algorithms...