Help - Sorting large number of Strings within a file.
Hello friends,
I believe I am in an interesting situation and I need your expertise.
I have to sort (efficiently) more than 100,000,000 words in a text file separated by new line "/n".
Initially, I wanted to read the words and store them in an array (or any other data structure). Unfortunately, I cannot even create an array of String in that size - I get the error "Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit" Obviously I'm running out of VM.
Thus I came to understanding that I cannot keep this many Strings in a data structure in the first place and never mind sorting them.
My immediate solution was to keep these words in the file and sort them within that file without requiring so much virtual memory.
However, I cannot think of any efficient way to sort these guys within the file with out indexing.
Do you have any suggestions or other ideas? Because, I do not want to use insertion or selection sort which I suspect to be a bad development choice.
Thanks for your time people.
[1093 byte] By [
nomadturka] at [2007-11-27 11:56:49]

could use regex or split pretty sure that would work well and who in there right mind puts 100,000,000 words in a document?
Some OS'es come with a "sort" utility into which more effort has been put than into a, say, Java prgogram written from scratch.
why not put them in a database and then select (using a sort or order by) them back into a second file
> why not put them in a database and then select (using
> a sort or order by) them back into a second file
Kill them all and let God sort them out.
Or use a database.
1. Break the file in to manageable chunks.
2. For each chunk of the file do a sort on those words and write it to a new temporary file.
3. Then from each pair of files take the smallest element and write it to a new file, then write the other element, repeating until both files are merged into one file.
4. Repeat step 3 until all of the files have been merged.
It may not be nice or even close to a great solution, but it should give you ideas on how to process a large file when you can't store it all in memory at once. Try to use large chunks of the file to minimize disk IO.
> 1. Break the file in to manageable chunks.
> 2. For each chunk of the file do a sort on those
> words and write it to a new temporary file.
> 3. Then from each pair of files take the smallest
> element and write it to a new file, then write the
> other element, repeating until both files are merged
> into one file.
> 4. Repeat step 3 until all of the files have been
> merged.
>
> It may not be nice or even close to a great solution,
> but it should give you ideas on how to process a
> large file when you can't store it all in memory at
> once. Try to use large chunks of the file to minimize
> disk IO.
Yikes! I think the database idea was better than this!
> I have to sort (efficiently) more than 100,000,000
> words in a text file separated by new line "/n".
Are you really sure you need to sort them ?
> Initially, I wanted to read the words and store them
> in an array (or any other data structure).
> Unfortunately, I cannot even create an array of
> String in that size - I get the error "Exception in
> thread "main" java.lang.OutOfMemoryError: Requested
> array size exceeds VM limit" Obviously I'm running
> out of VM.
You can increase the heap size available
-Xms<size>set initial Java heap size
-Xmx<size>set maximum Java heap size
-Xss<size>set java thread stack size
However with a data set of that size I pretty sure that still won't help.
> Thus I came to understanding that I cannot keep this
> many Strings in a data structure in the first place
> and never mind sorting them.
Assuming you have plenty of real memory in your machine you could probably get them into a dynamic data structure like a HashMap. http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashMap.html
> However, I cannot think of any efficient way to sort
> these guys within the file with out indexing.
I not surprise, doing that would probably win you a Fields medal.
> Do you have any suggestions or other ideas? Because,
> I do not want to use insertion or selection sort
> which I suspect to be a bad development choice.
You need to choose an O(n log n) sort from list found here:
http://en.wikipedia.org/wiki/Sorting_algorithm
Or better still solve you problem without sorting the data set.
I wouldn't do any of that. I would go to the command line and simply use the "sort" command that's built into your operating system.