sorting large CSV file help

I have to sort 180MB csv file with java and I have no idea where to start, I can do it easily using ostermiller utility for reading and writing csv file and put this file into the JTable and then sort it but with large file like 180MB it doesn't work, does anyone have any idea what to do with file like this. I have to sort it once by surname and then by post code and produce two new csv files. Any help would be appreciated. Thx in advance!

[451 byte] By [kris_javaa] at [2007-10-2 8:54:54]
# 1

One approach is, to read in the CSV content into a List of Lists structure and use this:

/**

* Sort 2-dimensional data structure by a column. The elements of the input table need to be List objects. The first dimension

* (input List) has Lists as elements with each element List representing a row. The length of the (first) row

* is assumed to be the maximum column count. Each subsequent row needs to have at least the same number of elements

* as the first row. This table structure is then sorted by a column index. E.g. a 5x100 table (100 rows a 5 columns)

* can be sorted by the third column by setting 2 as the input columnIndex. The column values of this column are

* sorted via their comparators or their textual representation (toString()). After sorting this column, the rows

* are reordered depending on the new row indexes of the sorted column.

<b>Example:</b>

<table align="center" bgcolor="#E0E0E0" border=1 cellpadding="10" cellspacing="0"><tr><td><pre style="margin-top:0; margin-bottom:0">

Vector table = new Vector();

Vector row = new Vector();

row.add("1"); row.add("one"); row.add("eins");

table.add(row);

row = new Vector();

row.add("2"); row.add("two"); row.add("zwei");

table.add(row);

row = new Vector();

row.add("3"); row.add("three"); row.add("drei");

table.add(row);

List sorted = ICUtils.sortTable(table, 1, true); //sort table by second column

System.out.println("Sorted table: "+ICUtils.toString(sorted));

Class[] types = new Class[] {String.class, int.class};

Object[] values = new Object[] {"Hello", 42};

MyClass myObject = (MyClass) ICUtils.getNewObject("MyClass", types, values);

</pre></td></tr></table>

*

* @param table List of Lists to sort. The list elements of this list are considered to be the rows.

* @param columnIndex Index of inner Vector's element to consider for sorting. (0 = first column)

* @param ascending If true, sort in ascending order. If false, sort in descending order.

* @return Sorted table.

*/

@SuppressWarnings("unchecked")

static public List<List><Object>> sortTable(List<List><Object>> table, int columnIndex, boolean ascending) {

List<List><Object>> result = table;

if (table != null) {

int rowCount = table.size();

if (rowCount > 0) {

List<Object> row = table.get(0);

int columnCount = row.size();

if ((columnIndex >=0) && (columnIndex < columnCount)) { //valid column index?

//1. get column of values to sort:

Vector<Comparable> columnToSort = new Vector<Comparable>(rowCount); //Vector has full clone() implementation

for (int i = 0; i < rowCount; i++) {

row = table.get(i);

Comparable c = null;

Object o = row.get(columnIndex);

//each cell object must be unique, otherwise later comparison fails:

if (o instanceof String) {

o = new String((String) o); //unique strings

row.set(columnIndex, o);

}

if (o instanceof Comparable) {

c = (Comparable) o;

} else {

c = "" + o; //String implements comparable

}

columnToSort.add(c);

}//next row

//2. sort column:

List<? extends Comparable> sortedColumn = (List<? extends Comparable>) columnToSort.clone(); //original column required for later comparison

Collections.sort(sortedColumn); //this is the main sorting of values

//3. get sorted indices:

int[] sortedIndices = new int[rowCount];

int index = 0;

for (int i = 0; i < rowCount; i++) { //for each row of sorted column

Object o = sortedColumn.get(i);

for (int j = 0; j < rowCount; j++) { //for each row of original column

Object o2 = columnToSort.get(j);

if (o == o2) { //object found?

sortedIndices[index++] = j;

break;

}//next comparison

}//next row

}//next row

//4. rearrange table rows:

//append sorted to the end:

for (int i = 0; i < rowCount; i++) { //for each row

result.add(table.get(sortedIndices[i]));

}//next row

//remove old unsorted from the begin:

for (int i = 0; i < rowCount; i++) { //for each row

result.remove(0);

}//next row

//5. sort order:

if (!ascending) {

Collections.reverse(result);

}

}//else: input column index is invalid

}//else: no vetcor elements to sort

}//else: no input vector to sort

return result;

}//sortTable()

MartinHilperta at 2007-7-16 22:58:52 > top of Java-index,Other Topics,Algorithms...
# 2

180MB csv file is alot to just read into memory and sort. I suspect the previous solution won't work, even with -Xmx1500M jvm parameter.

But you could try to:

* Read lines with LineNumberReader

* Separate line by columns with StringTokenizer

* Encode strings into to UTF-8 byte[]'s - UTF-8 encoded bytes has same sort order as the chars in Strings

* Use Arrays.sort with your own Comparator that compare the right byte[] column

* Write sorted lines with PrintStream

It uses about 4 times memory of the source file size.

Gil

gilroittoa at 2007-7-16 22:58:52 > top of Java-index,Other Topics,Algorithms...
# 3
Can you use JDBC and an RDBMS?
YAT_Archivista at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 4
Exactly!!! the memory problem is the main issue. But I can use the jdbc as well!
kris_javaa at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 5

In that case the easiest way of doing it is

1. Create table TEMP_TABLE with suitable fields.

2. For each line of CSV do an INSERT into TEMP_TABLE.

3. SELECT * FROM TEMP_TABLE ORDER BY foo and write to foo.csv.

4. SELECT * FROM TEMP_TABLE ORDER BY bar and write to bar.csv.

5. DROP TABLE TEMP_TABLE.

YAT_Archivista at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 6
It seems to me that there are command line utilities that do this. Probably even ones that come with the OS. They often have a semi-obvious name like 'sort'.And if not then I would suppose there are thousands of utilities that one can google for.
jschella at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 7
/usr/bin/sort is too naive to handle the way CSV escapes work.
YAT_Archivista at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 8
Still haven't solved it?Email me at gil_roitto at hotmail.com for simple in-memory 100-line code for sorting csv file.Gil
gilroittoa at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 9

180 MB is not too much. Well, I am talking about server applications and today's servers at least 1 GB and much more of RAM. However, another solution is to use a TableModel and the TableSorter from here: http://java.sun.com/docs/books/tutorial/uiswing/components/table.html#sorting it's a nice plugin approach for JTable and lightning fast!

MartinHilperta at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 10

Already 180 MB is far too much, even for the "nice plugin". Just to load a table that size into Strings, that is 1 String / cell, will blast 1500MB (max 32 bit JVM mem). And to load it into a TableModel and have JTable ... just try for a 50 MB csv file and you'll see.

1 GB is another issue, than you need either 3rd party library/program or to program a disc-based sort algorithm:

1 For each chunks (e.g. 50 MB data):

Read, Sort and Write each chunk to a separate file for each chunk

2. Merge the sorted chunks files and write it to the result file

Gil

gilroittoa at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 11

Hello? Why should 180 MB blow up to 1.500 MB? We load multiple megabytes of CSV files into memory and the amount of memory for the resulting String[][] is about twice as high. That would result in about 360 MB. - not too much for a server. By optimizing your data (just load the data lines that really interest you, omit the columns you don't need, store null values instaed of empty strings, etc.) you will get a fraction of that amount. In addition, sorting and searching is very handy with such [][] (Arrays.binarysearch()). Of course JTable is much more memory hungry. So, first inspect what data of the 180 MB file you really need and decide what the real memory requirement is. I guess you don't really need each single string of the original CSV file ...?

MartinHilperta at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 12

Once you have loaded the required data in a table datatstructure like String[][] you can easily sort via

//myCsvData is a String[][] with 1st dimension being the columns and 2nd dimension rows for fast binary search!

Arrays.sort(myCsvData, new MyComparator(columnIndex));

/**

* Custom comparator for String[] objects.

*/

class MyComparator implements Comparator<Object> {

/** Index of column to sort by. */

private int colIndex = 0;

/**

* Constructor.

*

* @param columnIndex Index of column to sort (0 = first column).

*/

public MyComparator(int columnIndex) {

colIndex = columnIndex;

}//MyComparator()

/**

* Implements Comparator.

*/

public int compare(Object o1, Object o2) {

int result = 0;

if (o1 != null && o2 != null) {

String v1 = ((String[]) o1)[colIndex]; //column value of record1

String v2 = ((String[]) o2)[colIndex]; //column value of record2

if (v1 == null && v2 != null) {

result = -1; //v1 < v2

} else if (v1 != null && v2 == null) {

result = 1; //v1 > v2

} else if (v1 != null && v2 != null) {

result = v1.compareTo(v2);

}//else: v1 = v2 = null = equal

}//else: at least 1 record is null (which shouldn't occur ever)

return result;

}//compare()

}//MyComparator

With this data structure, you can search a data row via

int rowIndex = Arrays.binarySearch(data[columnIndex], searchValue); //requires not null values for searchValue

With this code we load roughly 100 MB of 50 CSV files into memory and traverse (searching, checking, updating) tenthousands of (also cached) database records in 20 seconds (incl. JDBC updates).

MartinHilperta at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 13

> Why should 180 MB blow up to 1.500 MB?

Of course depends on the data, but in my case I had an adress table:

firstname, lastname, adress, zip, city

All columns very short strings, no strings could be omitted.

For an average string length of 8, we count file size for each row:

5*8+5=45 bytes (1 byte/char, 5 ';' delimiters)

In Strings:

5 Strings, with 8 chars, 3 int variables:

5* (8+8*2 + 3*4) = 180 bytes (8 bytes: reference and length of char[], 2 bytes/char, 4 bytes/int)

Then some additional bytes for the String[][] structure.

1500MB was exaggerated, but still Strings are a memory killer.

Gil

gilroittoa at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 14

Loading and sorting 180MB of data in memory is the wrong approach

for this type of problem. Even if you manage to load the data, sorting

could take long.

The easiest way to sort a large CSV file would be to load the

CSV file as a table into a database like oracle.

See oracle external tables which are ideal...

http://www.adp-gmbh.ch/ora/misc/ext_table.html

Alternately if you dont want to use the database at all, you

could use the JDBC driver for CSV files. There are couple

of free implementations ...google for it.

example:

http://csvjdbc.sourceforge.net/

sudhisha at 2007-7-16 22:58:53 > top of Java-index,Other Topics,Algorithms...
# 15
Just to mention...csvjdbc drivers use java based sorting in memory. So you may not see any savings after all.
sudhisha at 2007-7-20 19:57:26 > top of Java-index,Other Topics,Algorithms...
# 16
> The easiest way to sort a large CSV file would be to> load the > CSV file as a table into a database like oracle.> If you have lots of time to wait for the indexing, thats an approach. Sorting in memory will be faster by magnitudes.Gil
gilroittoa at 2007-7-20 19:57:26 > top of Java-index,Other Topics,Algorithms...
# 17

> I have to sort 180MB csv file with java and I have no

> idea where to start, I can do it easily using

> ostermiller utility for reading and writing csv file

> and put this file into the JTable and then sort it

> but with large file like 180MB it doesn't work, does

> anyone have any idea what to do with file like this.

> I have to sort it once by surname and then by post

> code and produce two new csv files. Any help would be

> appreciated. Thx in advance!

I don't think anyone has offered this solution:

Create a List class that wraps a File instance. The list will return an isntance of an Object that represents the record in the file.

This a full pass through the file to build an index of the records.

Get pulls from the original file. When add is called, you update the index for the records that becomes the new logical position of the record while the phyisical position does not change.

Pass your List to the Collecions.sort() method. Then iterate over the List and write the sorted file.

dubwaia at 2007-7-20 19:57:26 > top of Java-index,Other Topics,Algorithms...
# 18

if you want to do the sorting purely in Java, I recommend using an algorithm called External Sort. It is widely used to sort huge file. It is basiclly divide and conquer and merge, breaking the file into lots of little chunks and sort then merge. It is quite efficient if implemented correctly.

Do a google search on External Sort on big file.

Here is a link that I used sometime ago to implement my big file external sort.

http://epaperpress.com/sortsearch/index.html

Go to "Very Large File" then External Sort, there is also algorithm B-tree for it, you can check it out. But external sort is better.

Good luck.

yue42a at 2007-7-20 19:57:26 > top of Java-index,Other Topics,Algorithms...