On-disk file format and data structure¶
Larch B-tree forests are stored on the disk as follows:
- Each forest is stored in a separate directory tree structure.
- The file
metadata
is in the INI format, and stores metadata about the forest, including the identifiers of the root nodes of the trees. - The
nodes
directory contains the actual nodes. - The
refcounts
directory contains reference counts for nodes.
The metadata file¶
The following values are stored in the metadata file:
format
is1/1
: node store format version 1 and node codec version 1.node_size
is the maximum size of an encoded node, in bytes.key_size
is the size of the keys, in bytes.last_id
is the latest allocated node identifier in the forest.root_ids
lists the identifiers of the root nodes of the trees.
Node files¶
Leaf nodes are encoded as follows:
- the first four bytes are ‘O’, ‘R’, ‘B’, ‘L’
- 64-bit unsigned integer giving the node identifier
- 32-bit unsigned integer giving the number of key/value pairs
- all keys catenated
- lengths of values as 32-bit unsigned integers
- all values catenated
Index nodes are encoded as follows:
- the first four bytes are ‘O’, ‘R’, ‘B’, ‘I’
- 64-bit unsigned integer giving the node identifier
- 32-bit integer giving the number of keys
- all keys catenated
- all child ids catenated, where each id is a 64-bit unsigned integer
All integers are in big-endian order.
All root nodes are index nodes, so decoding can start knowing that the first node is an index node.
Each node is stored in a file under the nodes
subdirectory,
using multiple levels of directories to avoid having too many files
in the same directory. The basename of the file is the node identifier
in hexadecimal.
The directory levels are user adjustable, see the IdPath
class
for details.
Reference counts¶
Each node has a reference count. When the count drops to zero, the
node file gets removed. Reference counts are stored in files under
the refcounts
subdirectory. Each file there contains up to
32768 reference counts, as a 16-bit unsigned big-endian integer.
Thus, the reference count for node i is file named refcount-N
where N is i modulo 32768.
A refcount file that is full of zeroes is removed. When looking up refcounts, if the file does not exist, the count is assumed to be zero. This avoids having to store refcounts for deleted nodes indefinitely.