Forward-moving data structures

Technically, any type of data can be stored in a Condensation tree. However, conflict-free synchronization is possible only with forward-moving data structures, where data moves forward according to some defined order.

As an example, consider a document that is being modified. Each modification yields a new version of the document, but we only keep the most recent (newest) version:

That is a very simple forward-moving data structure. The modification date assigns an order to the documents, and within that order, we can only move forward.

Whenever we see different versions of the document, we simply pick the most recent version. Modifying the document yields a new version with a newer modification date, which automatically replaces the previous version. To revert to an old version of the document, we simply have to re-save the old version with a newer modification date.

Keeping just the most recent version of something yields a maximum-merge structure, which is a special case of a union-forget structure.

Formal definition

A forward-moving data structure with a merge function ⊗ and a comparison function ≤ has the following properties:

Commutativity: A ⊗ B = B ⊗ A
Associativity: (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)
Idempotency: A ⊗ A = A
Order: A ⊗ B = B ⇔ A ≤ B

Commutativity ensures that two versions of the data can be merged in either way, while associativity says that different versions can be merged in any order. That's important in a distributed setting where no global total order is imposed.

Itempotency ensures that merging a version with itself does not change anything. This means that we can merge the same version as often as we want, and do not need to keep track of what has been merged, and what not:

A ⊗ B ⊗ B ⊗ ... ⊗ B = A ⊗ B

This also ensures that two versions with a common past are merged correctly:

D = B ⊗ C
  = (A ⊗ ΔB) ⊗ (A ⊗ ΔC)
  = A ⊗ A ⊗ ΔB ⊗ ΔC
  = A ⊗ ΔB ⊗ ΔC

The order ensures that merging is forward-moving with respect to the defined order, and can also be written as follows:

A ⊗ B = C   ⇔   A ≤ C and B ≤ C

Mathematical background

Forward-mergeability is closely related to the mathematical concept of semilattices. A forward-moving data structure forms a join-semilattice, and any version of the data is an element of the semilattice (set). Merging two versions corresponds to the join operation of the semilattice.

Maximum

Among the simplest forward-moving structures are numbers merged by taking their maximum:

A = 3
B = 15
A ⊗ B = max(3, 15) = 15

It is obvious that the result can never be smaller than any input value. Hence, merging never moves the state — here expressed as a number — backwards.

Boolean OR

Similarly, boolean values are forward-moving under the OR function and the order [○ false → ● true]:

A ⊗ B = A or B	○	●	●	●
A	○	●	○	●
B	○	○	●	●

In fact, OR-merging can be mapped to maximum-merging by mapping false to the number 0, and true to 1.

Similarly, AND is forward-moving under the order [true → false].

XOR, however, is not forward-moving. It trivially violates idempotency:

A ⊗ A = false ≠ A

For the same reason, most mathematical operations (addition, subtraction, ...) are not forward-moving.

Set union

Sets are forward-moving under union and set size.

A = {apple, banana}
B = {banana, lemon}
A ⊗ B = {apple, banana} ∪ {banana, lemon} = {apple, banana, lemon}

When unioning sets, the result necessarily has at least as many elements as any input set. Thus, we can never move backwards to a set with fewer elements.

Versioning systems such as git use union-merge for the revisions: they keep them all.

Maximum revisited

Maximum-merge can be mapped to union-merge by putting all numbers into a set, and considering the highest number only:

A = {3}
B = {15}
A ⊗ B = {3, 15}
C = {4, 17}
A ⊗ B ⊗ C = {3, 4, 15, 17}

This is conceptually equivalent, but not very efficient. The number 4, for instance, never makes any difference but is stored anyway. Nobody would notice if we "forgot" it. The end result — 17 — would still be the same.

Union-forget structures

This leads to union-forget structures. Such structures are based on union-merge and a set of carefully crafted rules defining which values can be forgotten.

A more general version of maximum-merge is top-N-merge, which keeps the N highest values only. Top-3-merge would work as follows:

A = {3, 5, 16}
B = {8, 9, 15}
A ⊗ B = {3, 5, 8, 9, 15, 16}

Similarly, one can forget information more than a week older than the most recent date:

A = {2018-02-19, 2018-02-24}
B = {2018-02-16, 2018-02-22}
A ⊗ B = {2018-02-16, 2018-02-19, 2018-02-22, 2018-02-24}

In a practical application, each timestamp would carry some information along with it.

Broom merge

Broom merge is a combination of a union-merged data set with timestamps, and a maximum-merged timestamp (the broom). No data item can be older than (behind) the timestamp – a concept closely related to that of the broom wagon used in cycling road races or other competitions.

Broom merge can be used to merge accounting data using a merge window, for example:

2018-02-21 → +5
2018-02-18 → -2
2018-02-12 → +1
2018-02-11 → 229 (broom)

The entry at the broom summarizes the history up to that point, while all transactions newer than the broom are stored individually.

Merging two versions keeps the newest broom, and all transactions equal or newer than that broom:

2018-02-21 → +5
2018-02-17 → -4
2018-02-12 → +1
2018-02-10 → +10
2018-02-08 → 219 (broom)

Transactions older than the newest broom are assumed to be included in the entry at the broom. If that is not the case, these transactions are lost.

The broom window, i.e. the time until which individual transactions are kept, is a trade-off:

Using a large window decreases the probability of losing unmerged transactions.
Using a small window saves storage space, and time to load the data.

The ideal trade-off is application-dependent, but a window of about 2-4 months is often enough in practice. Devices not synchronized for more than 2 months have often been lost, or those changes have been forgotten about by the user.