How unique is CRC32 checksum
I want to be able to uniquely identify a file by it's contents so I read the file and use java.utils.zip.CRC32 to compute the checksum value. I store the checksum value and compare any new file to the saved checksums to prevent the same file from being processed more than once. What is the possibility the I could have two files with different contents that return the same checksum? Here is my code to computes the checksum.
public static long computeChecksumForFile(File file) throws IOException
{
byte[] buf = new byte[1024];
int bytesRead;
FileInputStream fis = new FileInputStream(file);
CRC32 crc = new CRC32();
while (-1 != (bytesRead = fis.read(buf))) {
crc.update(buf, 0, bytesRead);
}
fis.close();
return crc.getValue();
}
[822 byte] By [
johnR] at [2007-9-27 19:45:07]

Well, CRC32 generates a 32-bit checksum. Since there are only 2^32 possible values, if you have more than 2^32 files you're guaranteed to have a collision. My understanding is that CRC32 makes collisions unlikely, but that a determined attacker can deliberately generate two files with the same checksum pretty easily.
Here's more info than you ever wanted about checksums:
http://www.freesoft.org/CIE/RFC/bynum.cgi?1071
Enjoy!
Does anyone know of a better algorithm to uniqually identify a file by it's contents?
johnR at 2007-7-6 23:08:00 >

A coworker suggested MD5 hash. Any opinions or information on which algorithm better suits my purpose?Here is the code for MD5.
public static String computeHashForFile(File file) throws IOException,
NoSuchAlgorithmException, NoSuchProviderException
{
byte[] buf = new byte[1024];
int bytesRead;
FileInputStream fis = new FileInputStream(file);
MessageDigest md5 = MessageDigest.getInstance("MD5","SUN");
while (-1 != (bytesRead = fis.read(buf)))
{
md5.update(buf, 0, bytesRead);
}
fis.close();
return toHexString(md5.digest());
}
/**
* This method just assembles bytes into a hex represented string.
*/
public static String toHexString(byte[] data)
{
StringBuffer result = new StringBuffer();
String hexMap[] = {"0","1","2","3","4","5","6","7","8","9",
"a","b","c","d","e","f"};
for (int i = 0; i < data.length; i++)
{
int l = ((((int)data)) & 0xF0) >> 4;
int r = ((int)data) & 0x0F;
result.append(hexMap[l] + hexMap[r]);
}
return result.toString();
}
johnR at 2007-7-6 23:08:00 >

MD5 is designed to address the weaknesses in CRC32. Here's a good discussion to check out:
http://deesse.univ-lemans.fr:8003/Connected/RFC/1510/69.html
The relative statement from RFC1510:
" The RSA-MD5 checksum calculates a checksum using the RSA MD5 algorithm [16]. The algorithm takes as input an input message of arbitrary length and produces as output a 128-bit (16 octet) checksum. RSA-MD5 is believed to be collision-proof."
Sounds like a good bet to me...
I believe MD5 has now been broken, although I'm not sure where I got that from. One technique which would work moderately well is to pick any block cipher you like and break your file into blocks. Then denoting C(N, K) as encrypting N under key K, you compute (assuming for the purposes of example, although it should work without subject to appropriate modifications, that the key length is the same as the block length)
E1 = C(D1, D1) where D1 is the first block of the file, etc.
E2 = C(D2, E1)
E3 = C(D3, E2)
...
En = C(Dn, En-1).
En is then your hash. If you're not worried about someone trying to deliberately mess you about, this should be fine. If you are, it'll need proper cryptanalysis for the cipher chosen. But if you are, most certainly don't use CRC32, because that's linear.
Alternatively, rather than use D1 as the initial key, use a randomly generated value, and use the same randomly generated value for all the files. By repeating with different randomly generated values, you can get the probability of collision extremely low. I venture to hope that if the block length is B, the probability of collision between two given, different, files is 1 in 2^B per key and keys are independent of each other. Thus the probability of collision between two given, different, files would be 1 in 2^(Bn) where n is the number of keys you use.
Thanks for the additional information. My purpose was not security. I just wanted to prevent processing the same file more than one. In this case someone else creates and names the files so if they mistakenly named two different files the same on different days I would know they are different and process them. The other case would be they mistakenly sent the same file to my application multiple times either with the same name or different names I would know that the file(s) had previously been processed and I could skip any further processing.
johnR at 2007-7-6 23:08:00 >

There is no way that you can create a 'signature' for a file where it can't theoritically map to the same value for a different file. At least not if the signature is shorter than the original file.
However, the real question is how realistic it is for the given problem domain.
> In this case someone else creates and names the files so if they
> mistakenly named two different files the same on different days
So evaluate your problem domain. How many files are created in a single day? How likely is it that they will rename the same file on the same day? How likely is it that they will send the same file from a different day (yesterday, a week ago, 15 years ago.)
Say you have 100 files a day. And for any given day it they might only duplicate 1 file (1%) and 0.5% for the preceding day. And that is it.
So with 2^32 combinations there is a 1/2^32 chance that any file will have the same signature. And there is 1.5% chance that it could occur in any day. So in a single year that gives a probability of
(365 * 0.015)/2^32
And that (presuming my rusty skills at probability and my calculations are right) ends up as a one billionth (slightly more) chance for a whole year that a signature might be duplicated.
A "total" solution could be to
- first compare the size of the files
- if some files have the same size
then compare some checksum [CRC -cheap?- or MD5 -expensive?-]
- if some files have also the same checksum
then you have to compare their content
This way, you have exactly zero chance to make a mistake.
> - if some files have also the same checksum>then you have to compare their contentOf course then you do have to keep the old files around.
Can you not just store a flag somewhere in the file content saying "I have been processed"? Regards,Tim
hianybody have any idea on implementing MD5 in JAva 1.1.8. thank youpria
> hi> anybody have any idea on implementing MD5 in JAva> 1.1.8. I would suppose that one would first figure out the algorithm and then write code to do it.
Why not just usefile.lastModified()
> Why not just use> file.lastModified()What happens if the file content comes through on a socket and the last mod time is not part of the delivery?