Passive deduplication mature
Since objects are identified by their SHA-256 hash, object stores inherently perform data deduplication. If the same object is submitted twice, it will only get stored once.
When using encryption, however, objects are hardly ever the same. In practice, deduplication therefore happens passively if a user is reusing objects – either to derive a new version of his data, or to share part of his date with friends.
Versioning
Keeping multiple versions of the same data is cheap if these versions share a lot of objects together. Similarly, data synchronization is cheaper if both parties know older versions of the data.
The document implementation takes advantage of that.
Deduplication vs. security
When sharing an object (or tree) with friends, a user has two possibilities:
- Send the object's hash to the friends. All friends will access the same object (deduplication), but an attacker can observe with whom the object has been shared.
- Create an individually encrypted copy for each friend. More disk space and bandwidth are required, but the hashes do not reveal with whom the object has been shared.
Hence, there is a trade-off between deduplication and security.
Avoid deterministic encryption keys
Full deduplication of encrypted objects would work if the encryption key was derived from the (unencrypted) content. For example, the SHA-256 hash of the unencrypted object could be used as encryption key. Such a system however has two important security issues:
- An observer would know with whom objects are shared.
- The content of short objects (up to 8–10 bytes of entropy) could be guessed by enumerating all possible byte sequences, and verifying if their SHA-256 sum allows to decrypt the object.
Both issues could be solved by adding a random 32-byte sequence to each object, which would in turn render deduplication impossible.
Hence, implementors are strongly advised to use random encryption keys.