Tree data structure for small data sets
| updated
Goal
To store a smallish set of data with an arbitrary level of nesting, such as a config or metadata block written in YAML
Requirements
We don’t need to modify or prune any parts of the structure.
But we need to be able to construct it iteratively by parsing data from a byte stream (string or file).
For safety, it would be good to be able to limit the size of structure, in order to handle malicious/buggy input.
Once it is constructed, we want to be able to iterate through the whole thing, or extract arbitrary parts from it.
Model
An append-only tree data structure, i.e. nodes that can contain other nodes
Nodes can be:
- leaf node: terminal, has a value but no children
- branch node: internal, has children but no value
- root node: a special internal node with no parent
- This could be used to record the size of the whole tree. Branch nodes can compute their size on the fly if needed; storing the size of each subtree on the node is a waste of memory.
We want to be able to model both lists (arrays) and dicts (maps).
A simple dict is a series of key-value pairs, where the values are all primitive (int/string, not list/dict). The keys are always strings. It is represented as a branch node that has only leaf nodes as children. The leaf nodes each have a name, representing the key in the dict.
A simple list contains only primitive values as elements. It is also represented as a branch node that has only leaf nodes as children, but the leaf nodes need not have names, i.e. their names are ignored.
A complex list is a series of values, some or all of which are lists/dicts. (It need not be homogeneous: some elements may be strings while others are int/list/dict.) It is represented as a branch node with children; the children may be branch or leaf nodes.
A complex dict is a series of key-value pairs, where some or all of the values are lists/dicts. (Also need not be homogeneous: the values may be of different types. The keys are still just strings, though.) It is represented as a branch node with children, each of which has a name; the children may be branch or leaf nodes.
Types
In a language without sum types (tagged unions), such as Go, we cannot use different types for the root, branch, and leaf nodes and use a sum type for the function signatures etc. It might be possible with generics though?
Instead, we can just have a single type, with fields which may be empty (zero values):
- leaf nodes have an empty list of children.
- branch nodes don’t have values.
- only a root node has a postive size; others have size=0.
- a node name may be ““, if the node represents an element of a list.
We could also add a “type” field or method (values are members of an enum), if it makes it easier to switch behaviour based on node type.
Implementation
- Proof of concept in Go
- Go library: javiljoen/miniyaml