NotesPassive deduplication

Passive deduplication mature

Since objects are identified by their SHA-256 hash, object stores inherently perform data deduplication. If the same object is submitted twice, it will only get stored once.

When using encryption, however, objects are hardly ever the same. In practice, deduplication therefore happens passively if a user is reusing objects – either to derive a new version of his data, or to share part of his date with friends.

Versioning

Keeping multiple versions of the same data is cheap if these versions share a lot of objects together. Similarly, data synchronization is cheaper if both parties know older versions of the data.

The document implementation takes advantage of that.

Deduplication vs. security

When sharing an object (or tree) with friends, a user has two possibilities:

Hence, there is a trade-off between deduplication and security.

Avoid deterministic encryption keys

Full deduplication of encrypted objects would work if the encryption key was derived from the (unencrypted) content. For example, the SHA-256 hash of the unencrypted object could be used as encryption key. Such a system however has two important security issues:

Both issues could be solved by adding a random 32-byte sequence to each object, which would in turn render deduplication impossible.

Hence, implementors are strongly advised to use random encryption keys.