Two files are same if they have the same content.
Finding out whether two files are the same would entail actually comparing the contents of the two files. If there is some library function I can use or some code that I can borrow it would save me time and effort. Also, what about non - text files? How does one go about testing them for equality?
I'd take a look at the DataInputStream class.
If two files have different checksums, they are different. If two files have the same checksum, they may or may not have the same contents. To be sure, you will have to compare byte by byte.
You also may look for a way to compare the lengths first; if they have different sizes, they can't have the same contents.
> I'd take a look at the DataInputStream class.
>
> If two files have different checksums, they are
> different. If two files have the same checksum, they
> may or may not have the same contents. To be sure,
> you will have to compare byte by byte.
>
> You also may look for a way to compare the lengths
> first; if they have different sizes, they can't have
> the same contents.
If it was really that easy then quite a few fools who had to write all these algorithms : http://en.wikipedia.org/wiki/Check_sum
Here's how I'd do it:
A and B are files.
open A and B
if (A.length != B.length) return false
read files A and B into a couple of byte arrays AA and BB
read B
if (AA.checksum != BB.checksum ) return false
if ( mode = SLOW ) {
if (AA.contents != BB.contents) return false
}
return true
notes:
* Here's a fast way to slurp a file into byte array.start = System.currentTimeMillis();
byte[] bufferFS = new byte[(int)file.length()];
InputStream fis = new FileInputStream(file);
//ByteArrayOutputStream baos = new ByteArrayOutputStream();
int size = fis.read(bufferFS);
//baos.write(bufferFS, 0, size);
fis.close();
//baos.close();
System.out.println("size fis read: " + (System.currentTimeMillis()-start) );
* I just googled "java checksum" and found an abundance. This was top of the list http://www.rgagnon.com/javadetails/java-0416.html and it looks pretty good to me
> CHECKSUM buddy..
You keep interjecting with this with obviously no clue of what doing this involves.
It's a dumb idea.
Here's why.
- to create a checksum for the file you need to read ALL the bytes of the file. AKA this method gains nothing over comparing each byte anyway.
- as mentioned there is a possibility - however remote - that different files could have the same checksum - while again this is unlikely it does mean that this method is less accurate than comparing each byte
So really if you have a suggestion great. But don't keep harping on things especially when you don't know what you are talking about.
> Why re invent the wheel...
> checking two files say a 100 MB in size byte by byte
> will be a major hog on the system.
Again with this.
You should learn about what you are talking about before you start making negative comments about other code.
Your method is also much worse for performance.
Given two files.
1) you compare lengths first, if different then they are different
2) compare byte by byte in both files. as soon as you hit a different byte you are done
With your method you MUST read all of both files.
> But then wats the real use of check sum... there
> seems to be quite a lot of algorithms here ?
Yes checksums and other hashes have their uses but just not in this application.
If you already had a checksum for the files then comparing checksums would obviously be quicker.
For example I have a server with a file which I have already computed the checksum for. You download the file from my server and then get the checksum I got as well. Then you compute the checksum on your side. Now you compare my checksum and yours, if they are the same then you can be relatively certain the file was not corrupted during the download process.
In that model getting the checksum is better than having to compare byte by byte over the network.
Another use for file checksums if for intrustion detection systems on servers. In this model checksums for all (or some) files that are not expected to change on the system are calculated and stored. At scheduled intervals the checksums for these files are recalculated and compared with the stored value. If the checksums are different then the files have changed (which may in this case indicate that an intruder has broken into the system or that the system has been compromised by a virus or trojan)
Anyway, there are other uses for file checksums and other hashing type algorithms for sure, just in this case it's redundant.
cotton (shock horror) was right. It's faster just to compare the two files byte by byte, even if you slurp the whole of both files. It would be quicker again to compare the two files buffer by buffer... but I'll leave version 2 up to you.
package forums;
import java.io.File;
import java.util.Arrays;
import krc.io.Md5Utilz;
import krc.io.FileUtilz;
class CompareFiles
{
public static void main(String[] args) {
if (args.length != 2) System.exit(2);
long start = System.currentTimeMillis();
int retval = 1;
try {
File a = new File(args[0]);
File b = new File(args[1]);
if ( a.equals(b) ) {
System.out.println("a and b are the same physical file "+a.getCanonicalPath());
} else if ( a.length() != b.length() ) {
System.out.println("a and b are different sizes.");
//} else if ( !Md5Utilz.getChecksum(args[0]).equals(Md5Utilz.getChecksum(args[1])) ) {
// System.out.println("a and b have different checksums.");
} else if ( !Arrays.equals(FileUtilz.readBytes(args[0]), FileUtilz.readBytes(args[1])) ) {
System.out.println("a and b have different contents.");
} else {
System.out.println("a and b are the same.");
retval = 0;
}
} catch (Exception e) {
e.printStackTrace();
retval = 2;
}
System.out.println("took "+(System.currentTimeMillis()-start));
System.exit(retval);
}
}
package krc.io;
import java.io.InputStream;
import java.io.FileInputStream;
import java.security.MessageDigest;
//http://www.rgagnon.com/javadetails/java-0416.html
//http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/Checksum.html
public abstract class Md5Utilz {
private static byte[] createMd5Checksum(String filename)
throws Exception
{
byte[] bfr = new byte[4096];
InputStream fis = new FileInputStream(filename);
MessageDigest md5 = MessageDigest.getInstance("MD5");
int size;
do {
size = fis.read(bfr);
if(size > 0) md5.update(bfr, 0, size);
} while (size != -1);
fis.close();
return md5.digest();
}
public static String getChecksum(String filename)
throws Exception
{
byte[] b = createMd5Checksum(filename);
StringBuffer sb = new StringBuffer(b.length);
for (int i=0; i<b.length; i++) {
// append 1st digit of hex((byte bitwise-AND 255)+256)
sb.append(Integer.toString( (b[i] & 0xff) + 0x100, 16).substring( 1 ) );
}
return sb.toString();
}
}
package krc.io;
import java.util.Collection;
import java.util.List;
import java.util.ArrayList;
import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.io.FileInputStream;
public abstract class FileUtilz
{
public static boolean verboseMode = false;
public static void writeFile(String content, String filename)
throws IOException
{
PrintWriter out = null;
try {
out = new PrintWriter(new FileWriter(filename));
out.write(content);
} finally {
try {if(out!=null)out.close();}catch(Exception e){}
}
}
public static String readFile(String filename)
throws IOException, FileNotFoundException
{
FileReader in = null;
StringBuffer out = new StringBuffer();
try {
in = new FileReader(filename);
char[] cbuf = new char[4096];
int n = in.read(cbuf, 0, 4096);
while(n > 0) {
out.append(cbuf);
n = in.read(cbuf, 0, 4096);
}
} finally {
try {if(in!=null)in.close();}catch(Exception e){}
}
return out.toString();
}
public static String[] readFileIntoArray(String filename)
throws IOException, FileNotFoundException
{
return readFileIntoList(filename).toArray(new String[0]);
}
public static List<String> readFileIntoList(String filename)
throws IOException, FileNotFoundException
{
BufferedReader in = null;
List<String> out = new ArrayList<String>();
try {
in = new BufferedReader(new FileReader(filename));
String line = null;
while ( (line = in.readLine()) != null ) {
out.add(line);
}
} finally {
try {if(in!=null)in.close();}catch(Exception e){}
}
return out;
}
public static byte[] readBytes(String filename)
throws IOException, FileNotFoundException
{
//start = System.currentTimeMillis();
File file = new File(filename);
byte[] out = new byte[(int)file.length()];
InputStream in = new FileInputStream(file);
int size = in.read(out);
in.close();
//System.out.println("readBytes("+filename+"="+size") took "+(System.currentTimeMillis()-start));
return out;
}
public static String basename(String path, boolean cutExtension)
{
String fname = (new File(path)).getName();
if (cutExtension) {
int i = fname.lastIndexOf(".");
if (i > 0) {
fname = fname.substring(0,i);
}
}
return fname;
}
public static String dirname(String path)
{
return (new File(path)).getParent();
}
}
It is possible for checksums and md5sum to generate false positives. The best way to do it if you have both files right in front of you is a byte by byte comparison. If you are in a unix environment, simply typing md5sum filename at the console is simpler than writing a program to do it for you. You could download a version for windows as well.
version 2package forums;
import krc.io.FileUtilz;
class CompareFiles
{
public static void main(String[] args) {
if (args.length != 2) System.exit(2);
long start = System.currentTimeMillis();
try {
String message = FileUtilz.isSameFile(args[0],args[1]) ? "same" : "diff";
System.out.println(message+" "+args[0]+" "+args[1]);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("took "+(System.currentTimeMillis()-start));
}
}
package krc.io;
import java.io.File;
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.FileNotFoundException;
import java.util.Arrays;
public abstract class FileUtilz
{
public static final int bfrSize = 4096;
public static boolean isSameFile(String filenameA, String filenameB)
throws IOException, FileNotFoundException
{
//start = System.currentTimeMillis();
File fileA = new File(filenameA);
File fileB = new File(filenameB);
//check for same physical file
if( fileA.equals(fileB) ) return(true);
//compare sizes
if( fileA.length() != fileB.length() ) return(false);
//compare contents (buffer by buffer)
boolean same=true;
InputStream inA = null;
InputStream inB = null;
try {
inA = new FileInputStream(fileA);
inB = new FileInputStream(fileB);
byte[] bfrA = new byte[bfrSize];
byte[] bfrB = new byte[bfrSize];
int sizeA=0, sizeB=0;
do {
sizeA = inA.read(bfrA);
sizeB = inA.read(bfrB);
if ( sizeA != sizeB ) {
same=false;
} else if ( sizeA == 0 ) {
//do nothing
} else if ( !Arrays.equals(bfrA,bfrB) ) {
same=false;
}
} while (same && sizeA != -1);
} finally {
if(inA!=null)inA.close();
if(inB!=null)inB.close();
}
return(same);
}
}
Message was edited by: corlettk