Comparing two files in Java

Hi,I want to be able compare two files and say if they are the same or not. How can I do this in Java? The files could be of any format. Any pointers on this? Thanks.
[180 byte] By [viddhua] at [2007-11-27 5:11:55]
# 1
Use check sum algorithm.Try to google on this
New_Kida at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 2
Define: when are two files the same? Thinking about and specifying that should be pointer enough.
CeciNEstPasUnProgrammeura at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 3

Two files are same if they have the same content.

Finding out whether two files are the same would entail actually comparing the contents of the two files. If there is some library function I can use or some code that I can borrow it would save me time and effort. Also, what about non - text files? How does one go about testing them for equality?

viddhua at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 4
CHECKSUM buddy..
New_Kida at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 5

I'd take a look at the DataInputStream class.

If two files have different checksums, they are different. If two files have the same checksum, they may or may not have the same contents. To be sure, you will have to compare byte by byte.

You also may look for a way to compare the lengths first; if they have different sizes, they can't have the same contents.

OleVVa at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 6

> I'd take a look at the DataInputStream class.

>

> If two files have different checksums, they are

> different. If two files have the same checksum, they

> may or may not have the same contents. To be sure,

> you will have to compare byte by byte.

>

> You also may look for a way to compare the lengths

> first; if they have different sizes, they can't have

> the same contents.

If it was really that easy then quite a few fools who had to write all these algorithms : http://en.wikipedia.org/wiki/Check_sum

New_Kida at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 7
> You also may look for a way to compare the lengths> first; if they have different sizes, they can't have> the same contents.File.length() can do that.
OleVVa at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 8

Here's how I'd do it:

A and B are files.

open A and B

if (A.length != B.length) return false

read files A and B into a couple of byte arrays AA and BB

read B

if (AA.checksum != BB.checksum ) return false

if ( mode = SLOW ) {

if (AA.contents != BB.contents) return false

}

return true

notes:

* Here's a fast way to slurp a file into byte array.start = System.currentTimeMillis();

byte[] bufferFS = new byte[(int)file.length()];

InputStream fis = new FileInputStream(file);

//ByteArrayOutputStream baos = new ByteArrayOutputStream();

int size = fis.read(bufferFS);

//baos.write(bufferFS, 0, size);

fis.close();

//baos.close();

System.out.println("size fis read: " + (System.currentTimeMillis()-start) );

* I just googled "java checksum" and found an abundance. This was top of the list http://www.rgagnon.com/javadetails/java-0416.html and it looks pretty good to me

corlettka at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 9
Why re invent the wheel...checking two files say a 100 MB in size byte by byte will be a major hog on the system.
New_Kida at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 10

> CHECKSUM buddy..

You keep interjecting with this with obviously no clue of what doing this involves.

It's a dumb idea.

Here's why.

- to create a checksum for the file you need to read ALL the bytes of the file. AKA this method gains nothing over comparing each byte anyway.

- as mentioned there is a possibility - however remote - that different files could have the same checksum - while again this is unlikely it does mean that this method is less accurate than comparing each byte

So really if you have a suggestion great. But don't keep harping on things especially when you don't know what you are talking about.

cotton.ma at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 11

> Why re invent the wheel...

> checking two files say a 100 MB in size byte by byte

> will be a major hog on the system.

Again with this.

You should learn about what you are talking about before you start making negative comments about other code.

Your method is also much worse for performance.

Given two files.

1) you compare lengths first, if different then they are different

2) compare byte by byte in both files. as soon as you hit a different byte you are done

With your method you MUST read all of both files.

cotton.ma at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 12
If I'm wrong sorry then... V had used checksum to verify the similarity in files.So I got stuck with it..But then wats the real use of check sum... there seems to be quite a lot of algorithms here ?
New_Kida at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 13

> But then wats the real use of check sum... there

> seems to be quite a lot of algorithms here ?

Yes checksums and other hashes have their uses but just not in this application.

If you already had a checksum for the files then comparing checksums would obviously be quicker.

For example I have a server with a file which I have already computed the checksum for. You download the file from my server and then get the checksum I got as well. Then you compute the checksum on your side. Now you compare my checksum and yours, if they are the same then you can be relatively certain the file was not corrupted during the download process.

In that model getting the checksum is better than having to compare byte by byte over the network.

Another use for file checksums if for intrustion detection systems on servers. In this model checksums for all (or some) files that are not expected to change on the system are calculated and stored. At scheduled intervals the checksums for these files are recalculated and compared with the stored value. If the checksums are different then the files have changed (which may in this case indicate that an intruder has broken into the system or that the system has been compromised by a virus or trojan)

Anyway, there are other uses for file checksums and other hashing type algorithms for sure, just in this case it's redundant.

cotton.ma at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 14
yikesMessage was edited by: New_Kid
New_Kida at 2007-7-12 10:32:27 > top of Java-index,Java Essentials,Java Programming...
# 15
> yikes> ?
cotton.ma at 2007-7-21 21:21:56 > top of Java-index,Java Essentials,Java Programming...
# 16

cotton (shock horror) was right. It's faster just to compare the two files byte by byte, even if you slurp the whole of both files. It would be quicker again to compare the two files buffer by buffer... but I'll leave version 2 up to you.

package forums;

import java.io.File;

import java.util.Arrays;

import krc.io.Md5Utilz;

import krc.io.FileUtilz;

class CompareFiles

{

public static void main(String[] args) {

if (args.length != 2) System.exit(2);

long start = System.currentTimeMillis();

int retval = 1;

try {

File a = new File(args[0]);

File b = new File(args[1]);

if ( a.equals(b) ) {

System.out.println("a and b are the same physical file "+a.getCanonicalPath());

} else if ( a.length() != b.length() ) {

System.out.println("a and b are different sizes.");

//} else if ( !Md5Utilz.getChecksum(args[0]).equals(Md5Utilz.getChecksum(args[1])) ) {

// System.out.println("a and b have different checksums.");

} else if ( !Arrays.equals(FileUtilz.readBytes(args[0]), FileUtilz.readBytes(args[1])) ) {

System.out.println("a and b have different contents.");

} else {

System.out.println("a and b are the same.");

retval = 0;

}

} catch (Exception e) {

e.printStackTrace();

retval = 2;

}

System.out.println("took "+(System.currentTimeMillis()-start));

System.exit(retval);

}

}

package krc.io;

import java.io.InputStream;

import java.io.FileInputStream;

import java.security.MessageDigest;

//http://www.rgagnon.com/javadetails/java-0416.html

//http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/Checksum.html

public abstract class Md5Utilz {

private static byte[] createMd5Checksum(String filename)

throws Exception

{

byte[] bfr = new byte[4096];

InputStream fis = new FileInputStream(filename);

MessageDigest md5 = MessageDigest.getInstance("MD5");

int size;

do {

size = fis.read(bfr);

if(size > 0) md5.update(bfr, 0, size);

} while (size != -1);

fis.close();

return md5.digest();

}

public static String getChecksum(String filename)

throws Exception

{

byte[] b = createMd5Checksum(filename);

StringBuffer sb = new StringBuffer(b.length);

for (int i=0; i<b.length; i++) {

// append 1st digit of hex((byte bitwise-AND 255)+256)

sb.append(Integer.toString( (b[i] & 0xff) + 0x100, 16).substring( 1 ) );

}

return sb.toString();

}

}

package krc.io;

import java.util.Collection;

import java.util.List;

import java.util.ArrayList;

import java.io.File;

import java.io.FileReader;

import java.io.BufferedReader;

import java.io.FileWriter;

import java.io.PrintWriter;

import java.io.IOException;

import java.io.FileNotFoundException;

import java.io.InputStream;

import java.io.FileInputStream;

public abstract class FileUtilz

{

public static boolean verboseMode = false;

public static void writeFile(String content, String filename)

throws IOException

{

PrintWriter out = null;

try {

out = new PrintWriter(new FileWriter(filename));

out.write(content);

} finally {

try {if(out!=null)out.close();}catch(Exception e){}

}

}

public static String readFile(String filename)

throws IOException, FileNotFoundException

{

FileReader in = null;

StringBuffer out = new StringBuffer();

try {

in = new FileReader(filename);

char[] cbuf = new char[4096];

int n = in.read(cbuf, 0, 4096);

while(n > 0) {

out.append(cbuf);

n = in.read(cbuf, 0, 4096);

}

} finally {

try {if(in!=null)in.close();}catch(Exception e){}

}

return out.toString();

}

public static String[] readFileIntoArray(String filename)

throws IOException, FileNotFoundException

{

return readFileIntoList(filename).toArray(new String[0]);

}

public static List<String> readFileIntoList(String filename)

throws IOException, FileNotFoundException

{

BufferedReader in = null;

List<String> out = new ArrayList<String>();

try {

in = new BufferedReader(new FileReader(filename));

String line = null;

while ( (line = in.readLine()) != null ) {

out.add(line);

}

} finally {

try {if(in!=null)in.close();}catch(Exception e){}

}

return out;

}

public static byte[] readBytes(String filename)

throws IOException, FileNotFoundException

{

//start = System.currentTimeMillis();

File file = new File(filename);

byte[] out = new byte[(int)file.length()];

InputStream in = new FileInputStream(file);

int size = in.read(out);

in.close();

//System.out.println("readBytes("+filename+"="+size") took "+(System.currentTimeMillis()-start));

return out;

}

public static String basename(String path, boolean cutExtension)

{

String fname = (new File(path)).getName();

if (cutExtension) {

int i = fname.lastIndexOf(".");

if (i > 0) {

fname = fname.substring(0,i);

}

}

return fname;

}

public static String dirname(String path)

{

return (new File(path)).getParent();

}

}

corlettka at 2007-7-21 21:21:56 > top of Java-index,Java Essentials,Java Programming...
# 17

It is possible for checksums and md5sum to generate false positives. The best way to do it if you have both files right in front of you is a byte by byte comparison. If you are in a unix environment, simply typing md5sum filename at the console is simpler than writing a program to do it for you. You could download a version for windows as well.

robtafta at 2007-7-21 21:21:56 > top of Java-index,Java Essentials,Java Programming...
# 18

version 2package forums;

import krc.io.FileUtilz;

class CompareFiles

{

public static void main(String[] args) {

if (args.length != 2) System.exit(2);

long start = System.currentTimeMillis();

try {

String message = FileUtilz.isSameFile(args[0],args[1]) ? "same" : "diff";

System.out.println(message+" "+args[0]+" "+args[1]);

} catch (Exception e) {

e.printStackTrace();

}

System.out.println("took "+(System.currentTimeMillis()-start));

}

}

package krc.io;

import java.io.File;

import java.io.InputStream;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.FileNotFoundException;

import java.util.Arrays;

public abstract class FileUtilz

{

public static final int bfrSize = 4096;

public static boolean isSameFile(String filenameA, String filenameB)

throws IOException, FileNotFoundException

{

//start = System.currentTimeMillis();

File fileA = new File(filenameA);

File fileB = new File(filenameB);

//check for same physical file

if( fileA.equals(fileB) ) return(true);

//compare sizes

if( fileA.length() != fileB.length() ) return(false);

//compare contents (buffer by buffer)

boolean same=true;

InputStream inA = null;

InputStream inB = null;

try {

inA = new FileInputStream(fileA);

inB = new FileInputStream(fileB);

byte[] bfrA = new byte[bfrSize];

byte[] bfrB = new byte[bfrSize];

int sizeA=0, sizeB=0;

do {

sizeA = inA.read(bfrA);

sizeB = inA.read(bfrB);

if ( sizeA != sizeB ) {

same=false;

} else if ( sizeA == 0 ) {

//do nothing

} else if ( !Arrays.equals(bfrA,bfrB) ) {

same=false;

}

} while (same && sizeA != -1);

} finally {

if(inA!=null)inA.close();

if(inB!=null)inB.close();

}

return(same);

}

}

Message was edited by: corlettk

corlettka at 2007-7-21 21:21:56 > top of Java-index,Java Essentials,Java Programming...