gtseytin-blog: June 2011

Deduplication is one of the relatively recent technologies
used for data backup in order to save space and data communication
bandwidth (see, for example,
www.usenix.org/event/fast08/tech/full_papers/zhu/zhu_html). The
idea is to split the file to be backed up into "chunks" at some
point defined by the contents (not by the offset from the start),
so in case of changes involving insertion or deletion of some
portions of the file the splitting points will move to preserve
the unaffected chunks from the changes. Then for each chunk we
compute the hash value, e.g., the 160-bit SHA-1 checksum which we
store along with the chunk and use as a key for chunk search. If
the same key value is found next time we will assume that the chunk
is the same and will not transmit or store the chunk again.

The risk is that the same hash code will come from a completely
different chunk, but with 160-bit well randomized code the probablility
is much smaller than of any other possible adversity. However, if that
happens we will fail to save the new chunk and on an attempt of
recovery the old chunk will be returned instead of the right data.

So, what will happen if a backup provider chooses a smaller key size,
e.g., 32 bits? The probability of hash collisions will be much higher,
and they will probably show up as the number of the stored chunks
increases. Let us assume that we already have 2**16 (i.e., 65536)
chunks. We want the keys to be all different. For the first key there
are 2**32 possibilities out of 2**32, for the second key to avoid
the collision there will be just 2**32-1, for the next key 2**32-2
out of 2**32, etc. The probability of avoiding collisions with n keys
will be the product of (1-i/2**32) with i ranging from 0 to n-1.

Let ua use logarithms. We know that ln(1-i/2**32) is less than
-i/2**32. The sum of the logarithms will be less than -n*(n-1)/2**33.
With n equal to 2**16 it will be about -1/2, giving the probablility
of no collision of exp(-1/2), i.e., about 0.606. This brings the
probability of at least one error to about 40% with as few as 64K
chunks.

Can the error be detected? Not certainly because apart from having
the colllision and not storing the right chunk the customer will
need to request the recovery of the file containing the chunk, and
it doesn't happen very often (in fact, depending on whether the customer
uses the backup for emergency recovery or just to put away files not
immediately needed).

So let us assume that someone deploys the deduplication system with
just 32-bit keys. The system will be tested but, most likely, the
number of the files backed up and reovered in the testing will be
much less than 2**16 chunks. So the tests will succeed and the system will
be possibly accepted. But then in real use the chunk count will quickly
reach and exceed this value eventually leading to a mass loss of stored
data.

One more question is whether someone is going to deploy a system with
such short keys. But, believe it or not, I saw a company contemplating
just that.

So the challenge is: what kind of testing can detect this type of fault?

gtseytin-blog

Tuesday, June 14, 2011

A TESTING CHALLENGE FOR DEDUPLICATION SOFTWARE

Saturday, June 11, 2011

Hello