sorting large CSV file help
I have to sort 180MB csv file with java and I have no idea where to start, I can do it easily using ostermiller utility for reading and writing csv file and put this file into the JTable and then sort it but with large file like 180MB it doesn't work, does anyone have any idea what to do with file like this. I have to sort it once by surname and then by post code and produce two new csv files. Any help would be appreciated. Thx in advance!
One approach is, to read in the CSV content into a List of Lists structure and use this:
/**
* Sort 2-dimensional data structure by a column. The elements of the input table need to be List objects. The first dimension
* (input List) has Lists as elements with each element List representing a row. The length of the (first) row
* is assumed to be the maximum column count. Each subsequent row needs to have at least the same number of elements
* as the first row. This table structure is then sorted by a column index. E.g. a 5x100 table (100 rows a 5 columns)
* can be sorted by the third column by setting 2 as the input columnIndex. The column values of this column are
* sorted via their comparators or their textual representation (toString()). After sorting this column, the rows
* are reordered depending on the new row indexes of the sorted column.
<b>Example:</b>
<table align="center" bgcolor="#E0E0E0" border=1 cellpadding="10" cellspacing="0"><tr><td><pre style="margin-top:0; margin-bottom:0">
Vector table = new Vector();
Vector row = new Vector();
row.add("1"); row.add("one"); row.add("eins");
table.add(row);
row = new Vector();
row.add("2"); row.add("two"); row.add("zwei");
table.add(row);
row = new Vector();
row.add("3"); row.add("three"); row.add("drei");
table.add(row);
List sorted = ICUtils.sortTable(table, 1, true); //sort table by second column
System.out.println("Sorted table: "+ICUtils.toString(sorted));
Class[] types = new Class[] {String.class, int.class};
Object[] values = new Object[] {"Hello", 42};
MyClass myObject = (MyClass) ICUtils.getNewObject("MyClass", types, values);
</pre></td></tr></table>
*
* @param table List of Lists to sort. The list elements of this list are considered to be the rows.
* @param columnIndex Index of inner Vector's element to consider for sorting. (0 = first column)
* @param ascending If true, sort in ascending order. If false, sort in descending order.
* @return Sorted table.
*/
@SuppressWarnings("unchecked")
static public List<List><Object>> sortTable(List<List><Object>> table, int columnIndex, boolean ascending) {
List<List><Object>> result = table;
if (table != null) {
int rowCount = table.size();
if (rowCount > 0) {
List<Object> row = table.get(0);
int columnCount = row.size();
if ((columnIndex >=0) && (columnIndex < columnCount)) { //valid column index?
//1. get column of values to sort:
Vector<Comparable> columnToSort = new Vector<Comparable>(rowCount); //Vector has full clone() implementation
for (int i = 0; i < rowCount; i++) {
row = table.get(i);
Comparable c = null;
Object o = row.get(columnIndex);
//each cell object must be unique, otherwise later comparison fails:
if (o instanceof String) {
o = new String((String) o); //unique strings
row.set(columnIndex, o);
}
if (o instanceof Comparable) {
c = (Comparable) o;
} else {
c = "" + o; //String implements comparable
}
columnToSort.add(c);
}//next row
//2. sort column:
List<? extends Comparable> sortedColumn = (List<? extends Comparable>) columnToSort.clone(); //original column required for later comparison
Collections.sort(sortedColumn); //this is the main sorting of values
//3. get sorted indices:
int[] sortedIndices = new int[rowCount];
int index = 0;
for (int i = 0; i < rowCount; i++) { //for each row of sorted column
Object o = sortedColumn.get(i);
for (int j = 0; j < rowCount; j++) { //for each row of original column
Object o2 = columnToSort.get(j);
if (o == o2) { //object found?
sortedIndices[index++] = j;
break;
}//next comparison
}//next row
}//next row
//4. rearrange table rows:
//append sorted to the end:
for (int i = 0; i < rowCount; i++) { //for each row
result.add(table.get(sortedIndices[i]));
}//next row
//remove old unsorted from the begin:
for (int i = 0; i < rowCount; i++) { //for each row
result.remove(0);
}//next row
//5. sort order:
if (!ascending) {
Collections.reverse(result);
}
}//else: input column index is invalid
}//else: no vetcor elements to sort
}//else: no input vector to sort
return result;
}//sortTable()
180MB csv file is alot to just read into memory and sort. I suspect the previous solution won't work, even with -Xmx1500M jvm parameter.
But you could try to:
* Read lines with LineNumberReader
* Separate line by columns with StringTokenizer
* Encode strings into to UTF-8 byte[]'s - UTF-8 encoded bytes has same sort order as the chars in Strings
* Use Arrays.sort with your own Comparator that compare the right byte[] column
* Write sorted lines with PrintStream
It uses about 4 times memory of the source file size.
Gil
Can you use JDBC and an RDBMS?
Exactly!!! the memory problem is the main issue. But I can use the jdbc as well!
In that case the easiest way of doing it is
1. Create table TEMP_TABLE with suitable fields.
2. For each line of CSV do an INSERT into TEMP_TABLE.
3. SELECT * FROM TEMP_TABLE ORDER BY foo and write to foo.csv.
4. SELECT * FROM TEMP_TABLE ORDER BY bar and write to bar.csv.
5. DROP TABLE TEMP_TABLE.
It seems to me that there are command line utilities that do this. Probably even ones that come with the OS. They often have a semi-obvious name like 'sort'.And if not then I would suppose there are thousands of utilities that one can google for.
/usr/bin/sort is too naive to handle the way CSV escapes work.
Still haven't solved it?Email me at gil_roitto at hotmail.com for simple in-memory 100-line code for sorting csv file.Gil
180 MB is not too much. Well, I am talking about server applications and today's servers at least 1 GB and much more of RAM. However, another solution is to use a TableModel and the TableSorter from here: http://java.sun.com/docs/books/tutorial/uiswing/components/table.html#sorting it's a nice plugin approach for JTable and lightning fast!
Already 180 MB is far too much, even for the "nice plugin". Just to load a table that size into Strings, that is 1 String / cell, will blast 1500MB (max 32 bit JVM mem). And to load it into a TableModel and have JTable ... just try for a 50 MB csv file and you'll see.
1 GB is another issue, than you need either 3rd party library/program or to program a disc-based sort algorithm:
1 For each chunks (e.g. 50 MB data):
Read, Sort and Write each chunk to a separate file for each chunk
2. Merge the sorted chunks files and write it to the result file
Gil
Hello? Why should 180 MB blow up to 1.500 MB? We load multiple megabytes of CSV files into memory and the amount of memory for the resulting String[][] is about twice as high. That would result in about 360 MB. - not too much for a server. By optimizing your data (just load the data lines that really interest you, omit the columns you don't need, store null values instaed of empty strings, etc.) you will get a fraction of that amount. In addition, sorting and searching is very handy with such [][] (Arrays.binarysearch()). Of course JTable is much more memory hungry. So, first inspect what data of the 180 MB file you really need and decide what the real memory requirement is. I guess you don't really need each single string of the original CSV file ...?
Once you have loaded the required data in a table datatstructure like String[][] you can easily sort via
//myCsvData is a String[][] with 1st dimension being the columns and 2nd dimension rows for fast binary search!
Arrays.sort(myCsvData, new MyComparator(columnIndex));
/**
* Custom comparator for String[] objects.
*/
class MyComparator implements Comparator<Object> {
/** Index of column to sort by. */
private int colIndex = 0;
/**
* Constructor.
*
* @param columnIndex Index of column to sort (0 = first column).
*/
public MyComparator(int columnIndex) {
colIndex = columnIndex;
}//MyComparator()
/**
* Implements Comparator.
*/
public int compare(Object o1, Object o2) {
int result = 0;
if (o1 != null && o2 != null) {
String v1 = ((String[]) o1)[colIndex]; //column value of record1
String v2 = ((String[]) o2)[colIndex]; //column value of record2
if (v1 == null && v2 != null) {
result = -1; //v1 < v2
} else if (v1 != null && v2 == null) {
result = 1; //v1 > v2
} else if (v1 != null && v2 != null) {
result = v1.compareTo(v2);
}//else: v1 = v2 = null = equal
}//else: at least 1 record is null (which shouldn't occur ever)
return result;
}//compare()
}//MyComparator
With this data structure, you can search a data row via
int rowIndex = Arrays.binarySearch(data[columnIndex], searchValue); //requires not null values for searchValue
With this code we load roughly 100 MB of 50 CSV files into memory and traverse (searching, checking, updating) tenthousands of (also cached) database records in 20 seconds (incl. JDBC updates).
> Why should 180 MB blow up to 1.500 MB?
Of course depends on the data, but in my case I had an adress table:
firstname, lastname, adress, zip, city
All columns very short strings, no strings could be omitted.
For an average string length of 8, we count file size for each row:
5*8+5=45 bytes (1 byte/char, 5 ';' delimiters)
In Strings:
5 Strings, with 8 chars, 3 int variables:
5* (8+8*2 + 3*4) = 180 bytes (8 bytes: reference and length of char[], 2 bytes/char, 4 bytes/int)
Then some additional bytes for the String[][] structure.
1500MB was exaggerated, but still Strings are a memory killer.
Gil
Loading and sorting 180MB of data in memory is the wrong approach
for this type of problem. Even if you manage to load the data, sorting
could take long.
The easiest way to sort a large CSV file would be to load the
CSV file as a table into a database like oracle.
See oracle external tables which are ideal...
http://www.adp-gmbh.ch/ora/misc/ext_table.html
Alternately if you dont want to use the database at all, you
could use the JDBC driver for CSV files. There are couple
of free implementations ...google for it.
example:
http://csvjdbc.sourceforge.net/
Just to mention...csvjdbc drivers use java based sorting in memory. So you may not see any savings after all.
> The easiest way to sort a large CSV file would be to> load the > CSV file as a table into a database like oracle.> If you have lots of time to wait for the indexing, thats an approach. Sorting in memory will be faster by magnitudes.Gil
> I have to sort 180MB csv file with java and I have no
> idea where to start, I can do it easily using
> ostermiller utility for reading and writing csv file
> and put this file into the JTable and then sort it
> but with large file like 180MB it doesn't work, does
> anyone have any idea what to do with file like this.
> I have to sort it once by surname and then by post
> code and produce two new csv files. Any help would be
> appreciated. Thx in advance!
I don't think anyone has offered this solution:
Create a List class that wraps a File instance. The list will return an isntance of an Object that represents the record in the file.
This a full pass through the file to build an index of the records.
Get pulls from the original file. When add is called, you update the index for the records that becomes the new logical position of the record while the phyisical position does not change.
Pass your List to the Collecions.sort() method. Then iterate over the List and write the sorted file.
if you want to do the sorting purely in Java, I recommend using an algorithm called External Sort. It is widely used to sort huge file. It is basiclly divide and conquer and merge, breaking the file into lots of little chunks and sort then merge. It is quite efficient if implemented correctly.
Do a google search on External Sort on big file.
Here is a link that I used sometime ago to implement my big file external sort.
http://epaperpress.com/sortsearch/index.html
Go to "Very Large File" then External Sort, there is also algorithm B-tree for it, you can check it out. But external sort is better.
Good luck.
yue42a at 2007-7-20 19:57:26 >
