Bancos, a persistent KV-store in full OCaml

2026-01-15

For several years now, I have been interested in a rather particular structure that allows strings to be associated with values, and at the time I proposed Art, whose performance was more than satisfying when compared to Hashtbl, for example.

In short, from this starting point, I read a small collection of papers on the subject that attempted to play around with such a data structure. In addition, I came across rowex and p-art.

Based on these papers, I thought to myself: could we finally have an indexing mechanism between strings and values that is domain-safe and persistent (and this was before the appearance of OCaml 5)? And that's where the rowex project was born!

In this article, we will take a closer look at the development of bancos, which offers a persistent data structure that allows strings to be associated with numbers. Although the project was conceived before OCaml 5 (it was still possible to achieve true parallelism with OCaml 4, notably thanks to parmap), the data structure is domain-safe, meaning that readers and writers can manipulate the data structure in parallel.

Peel the onion

The project is structured so that we start with ART and end up implementing P-ART. However, the research papers' contributions are often "split" (as best as possible) so that the reference implementation does not fundamentally require certain specific processor instructions, a specific scheduler, or even a specific memory representation.

The idea is that our fundamental library (rowex) can be used in several contexts (and thus integrated into tools such as those related to emails or into a unikernel, as we will see). We will therefore break all this down.

Adaptive Radix Tree

So let's start at the beginning and introduce ART: Adaptive Radix Tree. In the world of data structures, ART is an improved version of Radix Tree that attempts to solve a problem: maintaining the same characteristics of a Radix Tree (particularly with regard to insertion) without being too space-intensive.

A Radix Tree has the disadvantage of requiring an array of 256 possibilities (that of a byte) for each node. In practice, not all of these possibilities are used. ART therefore offers nodes that adapt according to the keys they contain. There are therefore four types of nodes:

size 4 nodes
16
48
and, of course, 256

type 'a kind =
  | N4   : n4 kind
  | N16  : n16 kind
  | N48  : n48 kind
  | N256 : n256 kind
  | NULL : unit kind

and n4   = bytes
and n16  = bytes
and n48  = bytes
and n256 = N256_Key

type 'a record =
  { prefix : bytes
  ; mutable prefix_length : int
  ; mutable count : int
  ; kind : 'a kind
  ; keys : 'a }

type 'a node = { header : header; children : 'a elt array }
and  'a leaf = { value : 'a; key : key }
and  'a elt = Leaf of 'a leaf | Node of 'a node
and header = Header : 'a record -> header [@@unboxed]

The big advantage of ART, compared to other data structures (such as B*Tree), is that it is not necessary to rebalance the tree. A B*Tree always attempts to have a height of log(n) (where n is the number of keys) in order to have a search complexity of O(log(n)). To achieve this, it is often necessary to rebalance the tree, which involves reworking an entire part of the tree (the right or left side).

This is where ART comes in: there may be a kind of rebalancing when a node needs to be enlarged (from a node of 4 to a node of 16, for example), but this transformation always remains very local and, at worst, only involves the parent of the node to be enlarged (and not a large part of the tree) .

If there is no rebalancing, however, this means that ART is fundamentally unbalanced. This can be problematic, especially in terms of space. For so-called sparse keys (long keys that have nothing in common), ART has a compression mechanism: if we have a series of nodes that only contain one possibility each time, we prefer to compress this node path into a single node with a "prefix" that corresponds to this path.

In this example, we have a node containing the prefix "foo". We can reach "foo" and "foot" without having a height of 3 (or 4) thanks to the prefix.

[ node, prefix:"foo" ]
  ['\000'] -> ("foo", 21)
  ['t'] -> ("foot", 42)

Here is a draft of the insert implementation with regard to prefixes:

(* here we implement [insert] where:
   - [node] is the current node we traverse
   - [parent] is a reference to the location where our [node] is placed
   - [depth] is the current depth into our tree (how many characters have we
     already gone through)
   - [key] is the key we would like to insert with its [value]
 *)
let insert ... parent node depth key value =
  let diff = until_mismatch node ~off:depth key in
  if diff >= String.length key
  then (* in this case, there is no difference between the [key] and what the
          current [node] allows us to reach (taking into account its [prefix]),
          so we will go further into our tree. *) ...
  else
    (* in this case, we will replace our current [node] with a new node because
       we find a difference between our [key] and what is reachable by our
       [node] (taking into account the [prefix]). *)
    let new_node = new_node4 () in
    (* the prefix is shortened to the difference (this operation may also mean
       that there is no longer a prefix). *)
    new_node.prefix_length <- diff;
    Bytes.blit node.prefix 0 new_node.prefix 0 (min 10 diff);
    (* our old [node] becomes a "child" of our new [new_node] (so as not to
       forget its children). *)
    add_node new_node4 node.prefix.[diff] node;
    (* and we update the prefix of our old [node], which should only contain
       what comes after [diff]. *)
    let len = node.prefix_length - (diff + 1) in
    node.prefix_length <- len;
    Bytes.blit node.prefix (diff + 1) node.prefix 0 (min 10 len)
    (* we can add our key and update the [parent] to say that we have replaced
       [node] with [new_node]. *)
    add_leaf new_node4 key.[depth + diff] value;
    parent := new_node4

Finally, there is one last optimisation. If there is only one node path and it leads to a single value, we will compress the path and indicate that, despite the prefix, there are still bytes to check that we will not find in the form of consecutive nodes but directly in the leaf of our tree that contains the key (and, of course, its value).

Subtly, this is what happens in our example between "foo" and "foot", where we do not have a node that clearly determines that after 't', we should reach '\000' (as is the case with "foo"). When searching, we should be careful to compare what is requested with the final key, because we could very well reach the key "foot" with "footu", for example!

Benchmarks

The results of ART are ultimately very interesting. The insertion cost, compared to a Hashtbl, is extremely efficient for ART. The reason for this difference is quite simple: ART does not systematically need to scan all the characters in a string to insert it, unlike a Hashtbl (which attempts to calculate the hash from this string). In reality, ART attempts to scan and traverse the tree at the same time and as soon as there is an opportunity to insert a leaf, it does so. In other words, the insertion can be effective even before scanning all the characters in a given string.

In terms of search performance, we have more or less the same performance as a Hashtbl. Both structures attempt to scan all the bytes of the given string and traverse the structure accordingly. It's difficult to do better than that anyway.

Advantage of a radix tree

However, there is one feature that ART has that a Hashtbl does not: the ability to iterate over multiple keys (and their values) according to a prefix at minimal cost. After all, ART is still a Radix Tree!

val prefix_iter :
  prefix:key -> f:(key -> 'a -> 'acc -> 'acc) -> 'acc -> 'a t -> 'acc

Read-Optimized Write EXclusion

ROWEX is another research paper that attempts to answer a very OCaml 5 era question. When discussing parallelism, the ideal scenario is to manipulate lock-free data structures as often as possible.

Although they can be manipulated simultaneously by multiple cores, their implementation is often quite difficult and/or requires indirections in order to avoid problems that can be quite difficult not only to find but also to fix.

The idea behind ROWEX is to offer a not so complicated implementation that can be handled by multiple cores at the same time. More specifically, the idea is to use locks at specific points that do not cause too much degradation (particularly during searches and insertions).

We won't go into the implementation details, but there are certain changes (particularly regarding compression) that can be summarized as follows:

The use of atomic values (ranging from int8 to leint128).
The addition of a spin-lock for nodes, allowing a writer to obtain EXclusive access. It is essentially a mini-mutex so that two tasks cannot modify the same node at the same time.
The addition of a node level, enabling prefix modifications without disrupting active readers (thereby eliminating the need for locks for readers).

More generally, during insertion in particular, there may be certain changes that are protected by our spin-lock and that may render certain nodes obsolete. In the case of readers, it is not so serious if we end up reading an obsolete node, as a new read should bring us to the latest version of the information. However, when it comes to writers, encountering an obsolete node can be problematic because the writer would attempt to calculate where to place the data based on a node that is no longer in use. In this specific case, we restart the writer from the beginning.

(* our spin lock *)

let rec until_is_locked mem addr version =
  (* the second bit of our [version] value tells us whether the node is locked
     or not. *)
  if version land 0b10 = 0b10 then
    let* () = pause_intrinsic () in
    let* version = atomic_get mem (addr + _header_kind) Value.leintnat in
    until_is_locked mem addr version
  else return version
[@@inline]

let rec write_lock_or_restart mem addr =
  let* version = atomic_get mem (addr + _header_kind) Value.leintnat in
  let* version = until_is_locked mem addr version 0 in
  (* at this point, we should have a node that is not locked, we will therefore
     try to obtain exclusivity via [compare_exchange]. *)
  let* set =
    compare_exchange mem
      (addr + _header_kind)
      Value.leintnat version (version + 0b10)
  in
  (* if we haven't managed to lock the node [set = false], it means that someone
     else has taken exclusive control of it, so we try again. *)
  if not set then write_lock_or_restart mem addr
  else return ()
[@@inline]

The implementation we propose may be quite complex compared to ART, but we have implemented a single-core in-memory backend (bank) for our implementation and compared it with a Hashtbl. After several hours of afl-fuzz, we still haven't found any bugs. However, this does not test parallel read and write access (hence the single-core), but we plan to use at least OCaml 5 and TSan to validate our implementation (remember that we started the project before OCaml 5!).

Persistent ART

This is perhaps the most difficult part: persisting data writing. The RECIPE paper proposes a rather interesting approach that involves manipulating persistent memory and two processor instructions:

movnt64
clwb

The real problem with persistence is that there is a world of difference between the user who wants to do an atomic_store and what actually happens on the hard drive. Among other things:

the kernel may well cache the write and "commit" a set per page (especially if mmap is used)
the CPU may also cache the write and not actually transmit the information to the device for speed reasons (in particular, keeping the write in its L1/L2/L3 cache).
The hard drive can also cheat by caching the write operations issued by the CPU. That is why pmem exists.

RECIPE therefore proposes the idea of a processor instruction (clwb) that can force the CPU to actually write a memory page. Considering that this refers to DRAM, it should be truly persistent.

RECIPE also takes advantage of using a specific SSE2 instruction that modifies and persists a 64-bit value but does not pollute the cache (in the sense that CPUs may not synchronise on this write). This is particularly the case when a new entry is added, where the fact that CPUs are not synchronous, particularly readers, is not that serious. These readers will encounter the old version of the nodes (for which the traversal remains consistent; they simply won't see the latest version containing the new entry).

But the real advantage of RECIPE is that it offers a persistent implementation of ART/ROWEX that specifically requires only a few well-defined "instructions". Furthermore, our implementation is functorised through this interface.

type ('c, 'a) value =
  | Int8                : (atomic, int) value
  | LEInt               : (atomic, int) value
  | LEInt16             : (atomic, int) value
  | LEInt31             : (atomic, int) value
  | LEInt64             : (atomic, int64) value
  | LEInt128            : (atomic, string) value
  | Addr_rd             : (atomic, ro Addr.t) value
  | Addr_rdwr           : (atomic, rdwr Addr.t) value
  | OCaml_string        : (non_atomic, string) value
  | OCaml_string_length : (non_atomic, int) value

and atomic = Atomic
and non_atomic = Non_atomic

module type S = sig
  type 'a t
  type memory

  val bind : 'a t -> ('a -> 'b t) -> 'b t
  val return : 'a -> 'a t

  val atomic_get : memory -> Addr.t -> (atomic, 'v) value -> 'v t
  val atomic_set : memory -> Addr.t -> (atomic, 'v) value -> 'v -> unit t
  val set_n48_key : memory -> Addr.t -> int -> int -> unit t
  val fetch_add : memory -> Addr.t -> (atomic, int) value -> int -> int t
  val fetch_or : memory -> Addr.t -> (atomic, int) value -> int -> int t
  val fetch_sub : memory -> Addr.t -> (atomic, int) value -> int -> int t

  val compare_exchange :
       memory
    -> ?weak:bool
    -> Addr.t
    -> (atomic, 'a) value
    -> 'a Atomic.t
    -> 'a
    -> bool t

  val get : memory -> Addr.t -> ('t, 'v) value -> 'v t

  val persist : memory -> Addr.t -> len:int -> unit t
  val movnt64 : memory -> dst:Addr.t -> int -> unit t
  val pause_intrinsic : unit -> unit t

  val allocate : memory -> kind:[ `Leaf | `Node ] -> ?len:int -> string list -> Addr.t t
  val delete : memory -> Addr.t -> int -> unit t
  val collect : memory -> Addr.t -> len:int -> uid:int -> unit t
end

Let's put persistence into perspective because, as stated in the RECIPE research paper, the advantage of P-ART is that it can be used with persistent memory, which in most cases is not possible^(1)[#fn1]. This means that bancos does not currently comply with "durability" (ACID) since we cannot truly guarantee the persistence of a write without using persistent memory.

However, we can still consider:

the atomicity of a read/write operation (in the sense that the result, whether successful or unsuccessful, does not interfere with other operations)
consistency, since at all stages of an insertion or read, our tree remains "correct" (once again, we see the advantage here of modifying, at worst, only the pointer of a node's parent during insertion)
isolation, since we can have several writers and several readers at the same time and they do not interfere with each other (thanks to the lock at the node level)

As we can see, implementing P-ART does not require many instructions, and these instructions can be defined quite "formally". However, there is one last part that needs to be explained. As we can see, there are the alloc/collect/delete functions. These allow nodes to be allocated and possibly reused, so we will briefly introduce you to our garbage collector.

1: In reality, we could use msync(3P) to actually persist the values and mix what follows for unikernels with this system call. The fact is that P-ART remains fairly flexible about what should be underneath and allows us to offer several backends.

Garbage collector

bancos offers a mini GC. As mentioned with regard to ROWEX, nodes may become obsolete. If we only used OCaml, we could rely on its GC, but in the case of persistence, we manipulate the memory directly.

These obsolete nodes must therefore be collected. Next, there is a phase (sweep) that consists of considering these nodes as still obsolete but no longer used by anyone (knowing that they are unreachable anyway). In this case, they are considered free, and when a task wants to allocate a new node, we can reuse them.

The problem is that we cannot really directly consider these obsolete nodes as free. Indeed, there may be other tasks running in parallel that can traverse these nodes (remember, a reader can easily traverse an obsolete node and it is fine). To avoid disrupting anyone, these nodes are considered free as soon as we are sure that the task that made them obsolete is the oldest task. These nodes are normally no longer reachable, so any new tasks should not be able to traverse them. Only old tasks may need to traverse them.

It is therefore a mark-and-sweep GC that consists of marking a date on our nodes and sweeping them as soon as they are old enough (in relation to the active tasks working on our tree).

A KV-store as an unikernel

This is perhaps the most enjoyable part of whole project: ensuring that our implementation works on a unikernel. As we have said, we only need to implement Rowex.S to have a domain-safe and persistent data structure. The goal now is to propose an implementation of Rowex.S in relation to block devices.

However, a unikernel has two distinctive features:

The first is more of an advantage: we know that we will only have one CPU and that we do not necessarily need to issue true atomic instructions to load/store values.
The second requires an intermediate layer between our implementation and the block device. This is because we can only read and write pages!

We will set aside the issue of persistence (since, according to the P-ART design, we need access to a persistent memory) and propose an implementation that can translate atomic writes into page writes.

When we talked about P-ART, we mentioned a CPU-level cache for writes: this is often referred to as a write pipeline. In fact, the CPU never actually writes the values to RAM and prefers to manipulate its cache first. It therefore has a whole series of writes waiting (and this is when clwb explicitly requests that these writes be carried out). We could re-implement this cache and propose a persist function that requests to actually write (and therefore emit pages) what is represented in our write pipeline to a certain part of our block device.

This is slightly different from what the CPU does (at the hardware level) because the CPU has a cache size limit. We are more interested in transforming a series of atomic_set into a single continuous page write (a batch write).

So we implemented cachet.wr!

type 'fd writev = 'fd -> pos:int -> Bstr.t list -> unit

type 'fd t =
  { cache: 'fd Cachet.t
  ; pagesize: int
  ; pipeline: (int * value) Sequence.t
  ; mutable areas: Diet.t
  ; fd: 'fd
  ; number_of_pages: int
  ; map: 'fd Cachet.map
  ; writev: 'fd writev }

type sign = Unsigned | Signed
type endian = Le | Be | Ne

type 'a v =
  | Vi8 : sign -> int v
  | Vi16 : sign * endian -> int v
  | Vi32 : endian -> int32 v
  | Vi64 : endian -> int64 v
  | Vi128 : string v

val get : 'fd t -> int -> 'a v -> 'a
val set : 'fd t -> int -> 'a v -> 'a -> unit
val persist : 'fd t -> off:int -> len:int -> unit

The idea behind this module is to maintain a queue of the writes we want to perform. The get operation first checks whether the requested address has an associated write, and if so, returns what we should see if our writes were actually effective.

In reality, a set does not interact with the block device (it only adds a write to our queue). However, we can use persist and, in this case, send the latest version of our pages to the block device and actually write (with writev). In other words, we have recreated a write cache in OCaml!

Reading is perhaps the most interesting because it involves reading what is in our block device and unrolling the write operations on a specific area (where we want to read). In this case, our function is quite simple: we create a buffer of 16 * 3 (because there may be writes that overlap) and apply the writes if they have an impact on the area we are trying to read:

let unsafe_value_into_bytes : type a. ?off:int -> bytes -> a v -> a -> unit =
 fun ?(off = 0) buf k v ->
  match k with
  | Vi8 Unsigned -> Bytes.set_uint8 buf off v
  | Vi8 Signed -> Bytes.set_int8 buf off v
  | Vi16 (Unsigned, Le) -> Bytes.set_uint16_le buf off v
  ...
  | Vi128 -> Bytes.blit_string v 0 buf off 16

let unroll : type a. 'fd t -> at:int -> a v -> a =
 fun t ~at k ->
  let buf = Bytes.make (16 * 3) '\000' in
  let len = length_of_value k in
  (* get what we have from our block-device. *)
  Cachet.blit_to_bytes t.cache ~src_off:at buf ~dst_off:16 ~len;
  let a = at - 16 and b = at + 16 in
  let fn node =
    let at', Value (k, v) = Sequence.data node in
    if at' >= a && at' < b then begin
      let roff = if at' >= at then 16 + (at' - at) else 16 - (at - at') in
      (* apply our write *)
      unsafe_value_into_bytes ~off:roff buf k v
    end
  in
  (* apply all entries that modify our range [a..b]. *)
  Sequence.iter fn t.pipeline;
  (* get the final value *)
  match k with
  | Vi8 Unsigned -> Bytes.get_uint8 buf 16
  | Vi8 Signed -> Bytes.get_int8 buf 16
  | Vi16 (Unsigned, Le) -> Bytes.get_uint16_le buf 16
  ...
  | Vi128 -> Bytes.sub_string buf 16 16

We will then try to quickly determine whether we need to unroll or whether we need to read from our block device. To do this, we will use a data structure already introduced during the implementation of our TCP/IP stack with Miou: discrete interval encoding tree.

Each time we write, we will update this value (and indicate that an area has been updated) and know, when reading, whether what the user is requesting has been modified or not:

let get : type a. 'fd t -> int -> a v -> a =
 fun t offset k ->
  let len = length_of_value k in
  let z = Diet.inter (Diet.singleton offset (offset + len)) t.areas in
  if Diet.is_empty z then
    (* from our block device *)
    match k with
    | Vi8 Unsigned -> Cachet.get_uint8 t.cache offset
    | Vi8 Signed -> Cachet.get_int8 t.cache offset
    | Vi16 (Unsigned, Le) -> Cachet.get_uint16_le t.cache offset
    ...
    | Vi128 -> Cachet.get_string t.cache ~len:16 offset
  else
    (* combination from our block-device and our write pipeline *)
    unroll t ~at:offset k

I'll spare you the persist function, but it's not that complicated. It just involves identifying the pages that include the area we want to persist (off and len) and applying all the writes to them (as we did with unroll), then calling a single writev.

Results

We finally created a KV-store in the form of a unikernel called kevin. It offers an HTTP server (using Vif) with three routes:

GET http://unikernel/get/key to search for the value associated with a key
POST http://unikernel/add to add a key
DELETE http://unikernel/rem/key to delete a key

Attempting to add 160k entries to this unikernel takes some time, but memory usage remains very stable (in other words, no memory leaks are observed). The availability of our unikernel is also tangible (thanks to Miou); attempting to add these 160k entries does not prevent another client from searching for a key "at the same time" (keep in mind, we are running on a single CPU). Here, I am referring to the unavailability issues we previously observed with Lwt.

And our emails?

Regarding emails, in our previous articles, we discussed our search engine running on our archive system and the possibility of manipulating such an archive and offering such a search engine in the form of a unikernel.

For example, we took the caml-list and searched for some interesting items from our archive.

However, there is another way to browse your emails, and that is through threads. An email is sometimes (often, even) a reply to another email. Two pieces of information can be noted:

an email has a unique identifier (Message-ID)
it can inform the user which email it refers to as a reply (In-Reply-To)

The idea is therefore to reconstruct these trees, which correspond to series of emails linked via their Message-ID and In-Reply-To. The archive format, which comes from the Git PACKv2 format, can contain this type of tree. In other words, we would have a new type of object that associates an email with its In-Reply-To.

type t =
  { msgid : MessageID.t
  ; uid_into_archive : Carton.Uid.t
  ; children : MessageID.t list }

So, when we want to archive a series of emails, we will take care to complete a graph and deduce the successors for each email. We will save this as a last object in our PACK file.

let compute_graph emails =
  let graph = Graph.create () in
  let index = Art.make () in
  let fn (uid_into_archive, metadata) =
    match (metadata.Email.in_reply_to, metadata.Email.message_id) with
    | Some x, Some y ->
        let key = Art.key y in
        Graph.add_edge graph x y ; (* x -> y *)
        Art.insert index key (uid_into_archive, metadata)
    | None, Some x ->
        let key = Art.key x in
        Graph.add_vertex graph x ;
        Art.insert index key (uid_into_archive, metadata)
    | Some _, _ | None, None -> () in
  Seq.iter fn emails ;
  let fn msgid entries =
    let fn (_, _to) children = _to :: children in
    let key = Art.key msgid in
    match (Art.find_opt index key, Graph.fold_succ_e fn graph msgid []) with
    | None, _ -> entries
    | _, [] -> entries
    | Some (uid_into_archive, _), children ->
        { msgid; uid_into_archive; children } :: entries in
  Graph.fold_vertex fn graph []

Then, we can populate our KV store to associate the Message-IDs with these objects. Finally, thanks to our new Tree objects and our KV store, we can reconstruct the threads from our archive!

$ blaze rowex populate -i rowex.idx -p 0:pack-xxx.pack
$ blaze rowex tree -i rowex -p 0:pack-xxx.idx
┌── [b0163594edd0635503ab5822172b5944414213b2]─┐
│   ├── <CAHk-=wi5OoS9v3h7YOPf2rMFWGfHQUgNrFRJ3NVxJLZDn3qnBQ@mail.gmail.com>
│   └── <87ttd4tk26.fsf@redhat.com>

┌── [be8a8e0567e44bfe6cccd007444797aefd56ead2]─┐
│   ├── [9d4d62d0a6eee82c12d660b2338c468c6b3c47ac]─┐
│   │   ├── <20250326194423.3717668-5-yosry.ahmed@linux.dev>
│   │   ├── [8d5c3c09d7aec482ad22bcb2378bed5c1ec44b93]─┐
│   │   │   └── <5f714d7fb68aef92f1bea58a10deb4de1a10a5b8.camel@redhat.com>
│   │   ├── [ba7aaebfddc46908ea59d7b027a40ef866f94c33]─┐
│   │   │   ├── <aFn6pqLr6pShBfaU@google.com>
│   │   │   └── <11a2ece0061308f7e340529c4c42a9d81bab0045.camel@redhat.com>
│   │   └── <20250326194423.3717668-2-yosry.ahmed@linux.dev>
...

The example above comes from kvm.1. It should be noted that the archive is incomplete. In the threads shown above, our nodes can be represented by a hexadecimal value (a hash), meaning their content can be retrieved via blaze pack get or a Message-ID. This implies that when we computed our graph, we did not find the emails corresponding to those specific Message-IDs.

Our KV store is capable, in this instance, of knowing the email offsets within several PACKv2 files (each PACKv2 file being associated with a unique ID — in our example, the pack-xxx.pack file is associated with ID 0). We should therefore be able to handle multiple archives and populate our KV store from them.

The core idea is ultimately to provide two search methods:

Email search by content
Email search starting from threads

And we can achieve this quite easily now with our KV store! The next step is likely to enhance our Tree object with additional metadata, such as the email subject (to avoid systematically loading the email content just to retrieve this information), and finally, we should provide an efficient way to navigate through these emails.