Depending on SHA-1 uniqueness

I'm going to encrypt possibly millions of files and store them all in a single virtual directory using the SHA-1 hash as their name. So the question is, may I depend on the following: If the is the same, then the original was the same too - except for a small risk (negligible if compared to a risk of losing the data due to other reasons).

The hash lenght is 128 bits, so theoretically it must be enough for about 2**60 files, but I've heard about problems with a similar system (using .net).

[508 byte] By [Maaartina] at [2007-10-2 21:20:03]
# 1

> I'm going to encrypt possibly millions of files and

> store them all in a single virtual directory using

> the SHA-1 hash as their name. So the question is, may

> I depend on the following: If the is the same, then

> the original was the same too - except for a small

> risk (negligible if compared to a risk of losing the

> data due to other reasons).

Depends on how many files you have!

>

> The hash lenght is 128 bits,

No 160 bits (20 bytes) for SHA-1. Are you thinking of MD5?

> so theoretically it must

> be enough for about 2**60 files, but I've heard about

> problems with a similar system (using .net).

There is an approx. even chance of two or more random file names procucing the same hash when you have 2**80 files. This of course depends on SHA-1 producing a near random hash but the theory of hash function is not well understood.

I worked on a system where a hash of an encryption key was used at the key identifier and I know of no reported hash clashes. This may not be a representative as though there were a lot of key hashes the number did not run to millions and the keys were random bytes in the first place.

sabre150a at 2007-7-14 0:29:30 > top of Java-index,Security,Cryptography...
# 2

Depends on how many files you have!

Currently maybe 1e5, but it must work for all file of a large company as well.

No 160 bits (20 bytes) for SHA-1. Are you thinking of MD5?

No, I just was not thinking at all.

Probably MD5 won't be strong enough, but 20 bytes is impractical, as I need to encode each of them, so with a block length of 16 I'll get 32 bytes, with 12 bytes wastet.

I could use SHA-256 instead, I think with 32 bytes, collision is quite impossible.

This may not be a representative as though there were a lot of key hashes the number did not run to millions and the keys were random bytes in the first place.

With random bytes ran through a pseudorandom function it should be no problem. But my case is a more demanding one.

Maaartina at 2007-7-14 0:29:30 > top of Java-index,Security,Cryptography...
# 3

> Probably MD5 won't be strong enough, but 20 bytes is

> impractical, as I need to encode each of them, so

> with a block length of 16 I'll get 32 bytes, with 12

> bytes wastet.

Rather than hex encoding them, you could Base64 encode them then 20 bytes only goes to 28 bytes and not 40 bytes..

> I could use SHA-256 instead, I think with 32 bytes,

> collision is quite impossible.

But if a 20 byte hash is too long then the 32 bytes of a 256 byte hash must be too long!

>

> This may not be a representative as though there

> were a lot of key hashes the number did not run to

> millions and the keys were random bytes in the first

> place.

>

> With random bytes ran through a pseudorandom function

> it should be no problem. But my case is a more

> demanding one.

I suspect that the company I did the work for would not see your case as more demanding!

sabre150a at 2007-7-14 0:29:30 > top of Java-index,Security,Cryptography...
# 4

Rather than hex encoding them, you could Base64 encode them then 20 bytes only goes to 28 bytes and not 40 bytes..

Sorry, I meant ENCRYPTING.

But if a 20 byte hash is too long then the 32 bytes of a 256 byte hash must be too long!

I didn't express it well as I see.

So, MD5 ist often not considered to be secure enough.

Computing SHA-1 and encrypting it by AES would give 32 bytes, with 12 bytes wastet.

So it would be better to use SHA-256, which would be also 32 bytes wasting nothing.

I suspect that the company I did the work for would not see your case as more demanding!

I meant "demanding on the HASH UNIQUENESS", I know nothing else about the complexity of your work.

Maaartina at 2007-7-14 0:29:31 > top of Java-index,Security,Cryptography...
# 5

> Rather than hex encoding them, you could Base64

> encode them then 20 bytes only goes to 28 bytes and

> not 40 bytes..

>

> Sorry, I meant ENCRYPTING.

Out of interest, why do you need to hash and then encrypt? What makes just hashing inadequate?

<snip>

> I suspect that the company I did the work for

> would not see your case as more demanding!

>

> I meant "demanding on the HASH UNIQUENESS", I know

> nothing else about the complexity of your work.

:-) So did I!

sabre150a at 2007-7-14 0:29:31 > top of Java-index,Security,Cryptography...
# 6

Out of interest, why do you need to hash and then encrypt? What makes just hashing inadequate?

1. If you have a and I see the hash and have the same file, then I know you have it. That's not nice.

2. The (plain) hash will be used as a encryption key. So it must be stored encrypted.

Maaartina at 2007-7-14 0:29:31 > top of Java-index,Security,Cryptography...
# 7
Rather than use a block cipher (AES) why don't you use a steam cipher (rc4?) then your encrypted length will be the same as your hash length.Message was edited by: sabre150
sabre150a at 2007-7-14 0:29:31 > top of Java-index,Security,Cryptography...
# 8
I already use AES for encrypting the content, so I didn't think about other ciphers at all. And I don't know anything about the weaknesses of RC4, while AES is currently the standard (as the name says; but still true, not like DES).
Maaartina at 2007-7-14 0:29:31 > top of Java-index,Security,Cryptography...
# 9

Since I don't realy know what you are trying to do (though I can make a good guess) I don't think I can advise in any detail. If you insist on using AES in block mode then you will encrypt to a multiple of the block length (16 bytes) and if you use a random IV (you will need to or you will not get different encryptions for the same filename) your 20 byte hash (assuming sha-1) will require 48 bytes if you use the most simple approach.

If you prefix your hash with 12 random bytes and use CBC with no padding then you can get away with just having an encrypted result of 32 bytes.

As to whether or not the probability of a collision is too high, you have to make a judgement call. Using the birthday paradox then with sha-1 there will be about an even chance of two or rmore hashes being the same for 2^80 (approx 1,000,000,000,000,000,000,000,000) file names.

If you decide to use sha-256 then you will be better off from the point of view of the birthday paradox but your encrypted hash will require 48 bytes (32 bytes of hash + 16 bytes of random iv).

sabre150a at 2007-7-14 0:29:31 > top of Java-index,Security,Cryptography...
# 10

if you use a random IV (you will need to or you will not get different encryptions for the same filename)

The encryption key K is based not on the filename but on the file CONTENT.

And the encrypted file is stored using a new name N based on another hash of the file content.

And I really wish to get the same K and N if the original file content is the same.

So I use no random IV.

The key K gets encrypted and stored too, and that's how using SHA-1 having 20 bytes plaintext would lead to 32 bytes ciphertext,

and that was the reason I said, I'd prefer SHA-256 (with 32 bytes of plaintext and the same amount of ciphertext).

Using the birthday paradox...

Sure, with a random data or a perfect hashing SHA-1 would be more than good enough.

But in reality I'm not sure....

Maaartina at 2007-7-14 0:29:31 > top of Java-index,Security,Cryptography...