Questions regarding Disk I/O
Hey there, I have some questions regarding disk i/o and I'm fairly new to Java.
I've got an organized 500MB file and a table like structure (represented by an array) that tells me sections (bytes) within the file. With this I'm currently retrieving blocks of data using the following approach:
// Assume id is just some arbitary int that represents an identifier.
String f ="/scratch/torum/collection.jdx";
int startByte = bytemap[id-1];
int endByte = bytemap[id];
try{
FileInputStream stream =new FileInputStream(f);
DataInputStream in =new DataInputStream(stream);
in.skipBytes(startByte);
int position = collectionSize - in.available();
// Keep looping until the end of the block.
while(position <= endByte){
line = in.readLine();
// some pocessing here
String[]entry = line.split(" ");
String docid = entry[1];
int tf = Integer.parseInt(entry[2]);
// update the current position within the file.
position = collectionSize - in.available();
}
}catch(IOException e){
e.printStackTrace();
}
This code does EXACTLY what I want it to do but with one complication. It isn't fast enough. I see that using BufferedReader is the choice after reading:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
I would love to use this Class but BufferedReader doesn't have the function, "skipBytes(), which is vital to achieve what I'm trying to do. I'm also aware that I shouldn't really be using the readLine() function of the DataInputStream Class.
So could anyone suggest improvements to this code?
Thanks
null
[2349 byte] By [
toruma] at [2007-10-3 3:33:07]

What you actually want is RandomAccessFile.
Wow, thanks for the quick reply.
I've also done some reading on the RandomAccessFile (though I don't understand this naming. From what I can see it should be RandomAccessChannel). Anyhow from what I've read, RandomAccessFile isn't buffered (hence I haven't bothered using it).
Is it actually fast?
Well, anything faster than what I've got is great :D
toruma at 2007-7-14 21:27:27 >

I haven't tried it for perfomance, but if I need to be able to jump to certain offsets in a file, then I use a RAF. If I need to read it sequentially, I use a reader or stream. You might also wan tot consider specifying a certain buffer size for the underlying FileInputStream. Some say it matters. Try 8096 bytes or something.
FileInputStream doesn't have a buffer at all, and there is no underlying FileInputStream under a RandomAccessFile.
The lack of buffering of an RAF might nbe a disadvantage but it is dwarfed into insignificance by the ability to seek() to an arbitrary position in your 500MB file compared to skip() in a FileInputStream which is a sequential operation. Try it.
ejpa at 2007-7-14 21:27:27 >

> FileInputStream doesn't have a buffer at all, and
> there is no underlying FileInputStream under a
> RandomAccessFile.
Brain ****, sorry. I was thinking about BufferedInputStream (and adding that). But I didn't mean "underlying" to the RAF, but to the OP's code.
Thanks a lot for the tips guys. I'm acutally just writing a test program of different approaches before I use it in my program. If I can get it done soon, I'll post the results here.
toruma at 2007-7-14 21:27:27 >

Okay I've got some results and turns out DataInputStream is faster...
EDIT: I was wrong. RandomAccessFile becomes a bit faster according to my test code when the block size to read is large.
So I guess I could write two routines in my program, RAF for when the block size is larger than an arbitary value and FileInputStream for small blocks.
Here is the code:
public void useRandomAccess() {
String line = "";
long start = 1385592, end = 1489808;
try {
RandomAccessFile in = new RandomAccessFile(f, "r");
in.seek(start);
while(start <= end) {
line = in.readLine();
String[]entry = line.split(" ");
String docid = entry[1];
int tf = Integer.parseInt(entry[2]);
start = in.getFilePointer();
}
} catch(FileNotFoundException e) {
e.printStackTrace();
} catch(IOException ioe) {
ioe.printStackTrace();
}
}
public void inputStream() {
String line = "";
int startByte = 1385592, endByte = 1489808;
try {
FileInputStream stream = new FileInputStream(f);
DataInputStream in = new DataInputStream(stream);
in.skipBytes(startByte);
int position = collectionSize - in.available();
while(position <= endByte) {
line = in.readLine();
String[]entry = line.split(" ");
String docid = entry[1];
int tf = Integer.parseInt(entry[2]);
position = collectionSize - in.available();
}
} catch(IOException e) {
e.printStackTrace();
}
}
and the main looks like this:
public static void main(String[]args) {
DiskTest dt = new DiskTest();
long start = 0;
long end = 0;
start = System.currentTimeMillis();
dt.useRandomAccess();
end = System.currentTimeMillis();
System.out.println("Random: "+(end-start)+"ms");
start = System.currentTimeMillis();
dt.inputStream();
end = System.currentTimeMillis();
System.out.println("Stream: "+(end-start)+"ms");
}
The result:
--
Random: 345ms
Stream: 235ms
--
Hmmm not the kind of result I was hoping for... or is it something I've done wrong?
toruma at 2007-7-14 21:27:27 >

This is only the time to read 100k bytes. I don't think that's enough to form a definitive view of performance.
Where is the 'block size to read'? I don't see it.
You could make your second example faster by interposing a BufferedInputStream between the FileInputStream and the DataInputStream.
And you could make the whole thing faster in both examples by processing each line once instead of twice. At the moment you're reading ahead to a newline then going back over the data to look for spaces. Possibly the user of a Scanner would do that for you in one pass, and you could certainly code it that way yourself.
ejpa at 2007-7-14 21:27:27 >
