zip distributes over aggregation?

hi,

for the zip algorithm in java.util.zip doessize(zip(A) + zip(B))==approxsize(zip(A+B))

?

I've run the code below on a few directories of a couple of thousand files and the results seem to more or less say yes. This seems surprising because I'd expected zip(A+B) to be able to compress more heavily in the case where A and B share some characteristic picked up upon by the zip algorithm (?). (eg. if they were identical then you'd only need just over the space to zip one of them). So this might be unlikely for just two files but In the case where you're sayingsize(zip(f1...fn))==approxsize(zip(f1) + ... + zip(fn))

and n is fairly large i'd expected this to become significant.

import java.io.*;

import java.util.*;

import java.util.zip.*;

class ZipOutputStreamTest{

publicstaticvoid main(String[] arg)throws IOException{

File base =new File(arg[0]);// dir to zip

Set allfiles = getAllFiles(base);

// zip all files

zipfiles(allfiles,new File("c:/allfiles.zip"));

// zip all files individually

File newbase =new File("c:/allfiles");

newbase.mkdirs();

for(Iterator i = allfiles.iterator(); i.hasNext(); ){

File f = (File) i.next();

zipfiles(Collections.singleton(f),new File(newbase,""+f.hashCode()));

}

}

static Set getAllFiles(File base)throws IOException{

Set result =new HashSet();

if(base.isDirectory()){

File[] child = base.listFiles();

for(int i=0; child!=null && i<child.length; i++)

result.addAll(getAllFiles(child[i]));

}else{

result.add(base);

}

return result;

}

publicstaticvoid zipfiles(Set files, File output)throws IOException{

ZipOutputStream zos =new ZipOutputStream(new FileOutputStream(output));

for(Iterator i = files.iterator(); i.hasNext(); ){

File f = (File) i.next();

FileInputStream fis =new FileInputStream(f);

ZipEntry ze =new ZipEntry(f.getPath());

zos.putNextEntry(ze);

byte[] buffer =newbyte[1024*1024];

for(int n=0; ((n=fis.read(buffer))!=-1); ){

zos.write(buffer,0,n);

}

zos.closeEntry();

fis.close();

}

zos.close();

}

}

are there other compression libraries that would have this property?

thanks,

asjf>

[4232 byte] By [asjfa] at [2007-9-29 23:35:58]
# 1

You probably want to have "solid" compression with zip. It is supported by many compression formats/utilities (RarSoft's RAR for example), but it is not directly supported in zip file format. However, you can work around by creating a zip archive of all your files without compression (in "store" mode), and then compressing the resulting single file. This will give you some improvement in size over plain zip of all your files, especially when you have a large number of small files. For large files it will not help a lot, because zip compresses data in blocks, anyway.

elizarova at 2007-7-16 4:02:04 > top of Java-index,Other Topics,Algorithms...
# 2

You can "sort" your files using some similarity algorithm. The 7-Zip compression utility does exactly that: analyzes the files (grouping all equal files, grouping similar types) and uses a standard LZW algorithm.

To check if two files are exactly equal, check the MD5 hash instead of trying to compare each one to each other. If two files have the same hash, the probability of the files being different is very, very small, even with thousands of files in the same directory. (The correct probability is calculated solving the "birthday's paradox" problem, check the Wolfram's Mathematica site). If two files are equal, simply store an indicator of its presence, instead of fully storing the duplicate.

edsonwa at 2007-7-16 4:02:04 > top of Java-index,Other Topics,Algorithms...
# 3

what is the data you have ?

is the issue that you generate lots of files that have some information

the same (descriptions, etc ?) but the data is different ... ?

perhaps try and only generate the descriptions once, then only

generate raw data files and then add the description with each

set of data files you send out, ... i have no idea if this is appropriate

for you though ...

silk.ma at 2007-7-16 4:02:04 > top of Java-index,Other Topics,Algorithms...
# 4

thanks for the replies - i've experimented with zipping the files in STORE mode and then zipping that - and it turns out that the data i'm dealing with doesn't compress better anyhow... (~1MB gain off a total of ~150MB) - i was worried that there might be some big gains that were going to be missed :)

asjf

asjfa at 2007-7-16 4:02:04 > top of Java-index,Other Topics,Algorithms...