fun question - minimizing IO

Assume you have "n" sorted files, each of size "B".

Each sorted file contains the same number of

fixed length records. Each record has length "r".

You want to merge these files into a single sorted

file. Assume you have an array of length "B" to use

in memory.

What is the function that describes the minimum

number of IO operations required to merge the files?

I'll post the full solution in a couple of weeks. Post

just your answer if you think you solved it. Good luck.

[534 byte] By [rkippena] at [2007-10-3 9:34:51]
# 1
I'll give a Duke Dollar for a correct answer.
rkippena at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 2
B == number of bytes (r * num records)?Or B == number of records?I assume the former, but the initial approach I'm thinking of cares more abut numRecords than numBytes, so I want to confirm.
jverda at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 3
Nice to see you're up for the challenge jverd. Yes the former:B == number of bytes (r * num records)In other words:number of records per file = B / r
rkippena at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 4

iff B>n*RECORD_SIZE then you could try the following (if you can open all n-files at once.

read first record from all files. Put them in array B (actaully a tree would probably be prefered so they are pre sorted)

sort them

write out the lowest record.

read the next record from the file corresponding to the record you last wrote.

wash rinse repeat.

Note that this does not minimise IO explicitly as you are potentially jumping around amoungst files alot (reading the next record)

I only suggest it because you might find that in this case the OS file caching works to your benifit.

matfud

matfuda at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 5

> iff B>n*RECORD_SIZE then you could try the following

The idea is to minimize the IO "calls". So the following two calls

are treated as equivalent:

in.read();

in.read(byte[] arr, int offset, int maxRead);

My understanding is that there is huge overhead associated with any IO call. Thus, doing a read call for every record is going to be slow. It is much more efficient to read in as much data per call and write out as much data per call as possible.

I've never heard of OS file caching.

rkippena at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 6

Reading from files can be expensive. Opening files is also expensive.

The reason I suggested the above approach is

a) its very simple to implement

b) The Operating system (OS) caches, on your behalf, the contents of files you are reading.

The algorithm opens all files once (you can not do better then this) and relys on the OS caching the content of files you are reading to reduce the actaul IO you have to perform.

I'm not in any way suggesting that it explicitly minimizese IO calls to the OS but it may, due to OS level caching (which is relatively cheap), significatly reduce physical IO access (which is relatively expensive).

It should at least be used as a reference point for comparison.

Note that this approach uses significantly more memory then specifed. However that memory is not allocated by the program. It is handled by the OS. It technically meets the needs of the requirement but probably not the spirit of the requirement :)

cheers

matfud

matfuda at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 7
In addition it can be improved by loading more then one record form each file at a time. That takes more in-program memory though.It may be feasible if n*RECORD_SIZE is significatly less then BCheersmatfud
matfuda at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 8

some additional optimiaztions:

chose a subset, g, of the "n" files and merge those recursively.

From each file you load B/g records. (sorted as before).

write out the lowest records.

When you no longer have records from a particular file then load the next batch from that file.

matfud

Note this will require more files to be created and more files to be opened. Then the previous ideas. It may be faster though.

I'm basing this on the idea that processing time is significantly less important than IO.

NOTE that your mileage may vary

matfuda at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 9
> I'm basing this on the idea that processing time is significantly> less important than IO.The point in minimizing IO is to get the fastest processing time.
rkippena at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 10

Consider the case where n is large and B is small. For example n is a million and B is 1. You have a million files each with a single byte (and presumably a single record). The difficulty is the interpretation of the restriction of having a read write buffer of size B.

If you can only read in one single byte you can't do any direct comparisons. So you are forced into some sort of digital sorting scheme like where you just coun't how many times each byte occurs and output that byte the proper number of time.

When I need to output 300 bytes that are all the value 17, am I forced to do 300 single byte io writes because my read write butter is only one byte long?

Another problem, am I allowed to keep an integer array of byte counts? If so I only need to open each file once and update the appropriate bytecount - without this little bit of auxilary storage I am forced to open and read each file 256 times, once for each byte that I am counting.

The thing that is absurd about this is that in a typical algorithm design no one quibbles over a litte bit of fixed storage, but in ths problem as it was stated, where your buffer space was being limited to but a single byte it seems odd to permit a couple K of auxilary storage elsewhere, but the impact on the anaylysis is huge. It just seems too wierd to be looking at a problem where I can somehow keep track of a million file names and have but a single byte of working memory to work with.

So my conclusion was that the problem as stated was ill framed and did not have a "reasonable" solution. I decided to just wait to see what you posted later, since you clearly had something in mind. I assumed that your "solution" that you would post later would clear up what unstated assumptions you were making.

Then you posted another thread where you restate your IO question and considerably changed the problem. Since my other comments relate to that problem I'll see you over there in that other thread.

marlin314a at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...
# 11

> The difficulty is the interpretation of the restriction of having a read

> write buffer of size B.

Typically B will be as large as possible, e.g., whatever the JVM can handle without throwing an OutOfMemoryError. I don't think it's

difficult to comprehend having a limited amount of memory.

> If you can only read in one single byte you can't do any direct comparisons.

If it helps, you can replace the variables with real numbers.

I think this approach works for middle school students who

don't yet grasp the concept of a variable.

>When I need to output 300 bytes that are all the value 17,

See above.

> So my conclusion was that the problem as stated was ill framed

> and did not have a "reasonable" solution

If you have any reasonable questions or statements, I'd be happy

to provide clarification.

rkippena at 2007-7-15 4:50:11 > top of Java-index,Other Topics,Algorithms...