Java library for large list sorts in small amount of memory

Hi All

Wondering whether anybody would know of a ready made java library for sorting large lists of objects. We will write ourselves if necessary but I'm hoping this is the sort of problem that has been solved a million times and may hopefully have been wrapped behind some nice interface

What we need to do is something along the line of:

Load a large list with objects (duh!)

If that list fits into an allowed amount of memory great. Sort it and end of story.

If it won't then have some mechanism for spilling over onto disk and perform the best sort algorithm possible given some spill over onto disk.

These sorts need to be used in scenarios such as a random selection of 10% of the records in a 100G unsorted file and delivered sorted, with the jobs doing the sorts being limited to as little as 64MB of memory (these are extremes but ones that happen quite regularly).

Some of you might think "just use a database". Been there. Done that. And for reasons beyond the scope of this post its a no goer.

Thanking you for all your help in advance.

[1106 byte] By [Tuckya] at [2007-10-3 2:46:54]
# 1
Don't know if there's a preecooked library out there, but I seem to remember John Bentley's Programming Pealrs book talks about the algorithms for this sort of thing - I guess you'd be using a merge sort.
Michael@Doney.coma at 2007-7-14 20:35:40 > top of Java-index,Java Essentials,Java Programming...
# 2

Of course this kind of sorting was common back in days of yore when we had mainframes with less than 1MB or ram.

The classic algorithm uses serveral files.

1) Load as much as will fit

2) sort it

3) Write it to a temporary file

4) Repeat from 1, altnerating output between two or three temp files.

Your temporary files then each contain a series of sorted sequences (try saying that three times fast).

You then merge these sequences into longer sequences, creating more temp files. Then merge the longer sequences into even longer ones, and so on until you're left with just the one sequence.

Merging doesn't require any significant memory use however long the sequences. You only need to hold one record from each temp file being merged.

malcolmmca at 2007-7-14 20:35:40 > top of Java-index,Java Essentials,Java Programming...
# 3
Thanks for your help. Looks like we'll be building our own using the old spill to temps and merge sort from there.
Tuckya at 2007-7-14 20:35:40 > top of Java-index,Java Essentials,Java Programming...
# 4
Here's one source of code, I believe it's c. But there's more, just google the web http://epaperpress.com/sortsearch/index.html
ChuckBinga at 2007-7-14 20:35:40 > top of Java-index,Java Essentials,Java Programming...
# 5

The classic algorithm is quite a bit more sophisticated that that: it uses a replacement-selection distribution phase which continually reads the input and outputs the lowest record, starting a new file when the next record to be output is smaller than the last, which happens on average every 2N records where N is the memory capacity: this is followed by a balanced or polyphase or cascade merge. It's all in Knuth volume III.

ejpa at 2007-7-14 20:35:40 > top of Java-index,Java Essentials,Java Programming...