Records as generic data format draft

Condensation uses records as generic data encoding format.

Records are trees of byte sequences with optional hashes. Trees are extremely versatile: dictionaries, tables, n-dimensional arrays, and pretty much any other data structure can be mapped to a tree fairly straightforwardly. In addition, different data structures can easily be nested within each other.

A tree node can hold an arbitrary byte sequence. Binary data therefore does not need to be recoded or transformed, saving both disk space and complexity.

Serialization overhead

Records are serialized as a list of nodes. The node overhead depends on the length of the byte sequence:

Length of byte sequence	Overhead
0‒29 bytes	1 byte
30 bytes	2 bytes, 6.7 %
⁝	⁝
285 bytes	2 bytes, 0.7 %
286 bytes or more	9 bytes, < 3.15 %

A hash adds 36 bytes: 32 header bytes and 4 data bytes to store the hash index.

Since the root node's content is not stored, a record with no child nodes is serialized to a zero-length byte sequence.

Simplicity

The serialization scheme is very simple, and has virtually no error states. In common programming languages, the source code for serializing a record is just about 30 lines long, and deserialization is equally short.

Ease of use

In general, working with records is significantly easier than working with text files. TBD: no parsing, just traverse a tree, similarly simple, generic editor possible

Comparison

In general, records are more efficient than common text file formats, even if the data to encode is primarily text. The node overhead is competitive with line breaks or field delimiters, which usually take 1‒2 characters (bytes). In addition, records do not need quote or escape characters, and can store numbers and binary data more compactly.

Structurally, records are somewhat similar to XML. Record nodes are significantly simpler and purer than XML elements, however. With XML, simple values can be stored as attributes of a tag, as text node, or even as tag name of a child element. Records avoid this redundancy: simple values are always stored as byte sequence of a node.

Record encoding shares some similarity with X.690 (ASN.1, BER, CER, DER), but is considerably thinner and simpler. No data types (class tags) are stored, as the interpretation of the byte sequences is left to the application or protocol. Length encoding is simpler, too. Constructed tags are not necessary, since any node can contain any number of child nodes.