μTCP, Miou and unikernels

2025-03-24

You may know that we recently enabled the compilation of unikernels with OCaml 5! This rather impressive work, which synthesizes the participation of several people, has unlocked a new possibility for unikernels: the use of effects.

Before going any further, the design of unikernels with Solo5 does not allow the use of multiple cores. This would require a significant amount of work on Solo5, which we are not necessarily very interested in: unikernels remain fairly simple and should only require a single core for the majority of their objectives (in our opinion) — namely, implementing small services that can fit into an existing infrastructure¹.

Then, unikernels with OCaml 5 already work like our good old qubes-mirage-firewall. We took the time to test because OCaml 5 has a lot of changes in its runtime that could have broken behaviors that we expected for OCaml 4.14.

But what has changed now, and this is what I will try to show in this article, is the use of effects instead of lwt. In this case, we have developed our own scheduler Miou at Robur and it has been designed in particular to be used one day in a unikernel.

One last point about all this work: it takes advantage of a number of opportunities that we have been thinking about for quite some time:

the first is to use fewer functors. Our abstractions available in ocaml-tar, utcp, ocaml-tls and decompress have stood the test of time in our opinion — much more so than functors.
the second is to use bytes/string rather than bigstrings (even if the latter are still useful). Once again, it is difficult to talk about an improvement in performance, but it is worth a try.
the third is to use μTCP instead of mirage-tcpip. We have encountered quite a few problems with the latter, and the code is becoming quite unmaintainable and it is in any case more or less linked to only use lwt.

1: Solo5 also argues for such a choice by considering SMP (Symmetric MultiProcessors) and schedulers as non-objectives.

A note about Solo5

Before we start talking code, we need to understand a little of what Solo5 has to offer and what we need to do to enable our unikernels to chat with the rest of the world.

Solo5 is very basic. It deliberates (it's in the name) 5 "hypercalls". That is to say 5 "functions" that enable interaction with external events:

reading and writing on a block-device, a block device can be seen as a hard disk to which we have access "per page" (you can refer to Cachet, for example, to read/write in such a block-device). On the host system (the one that virtualizes the unikernel), it can be a simple file or an actual volume (like a ZFS volume) used by your host system.
reading and writing on a net-device. A net-device can be seen as an ethernet socket. On the host system, it is called a TAP interface, which is a virtual interface that can be attached to a virtual bridge to create a "network".
the unikernel shutdown, which is the last available hypercall!

You can see the documentation for these hypercalls here. That's all that Solo5 offers for the application domain.

Since these hypercalls, a unikernel can interact with the outside world via files or virtual network interfaces. For the latter, the only thing available to us is Ethernet packets with a destination (possibly our unikernel) and a source (another Ethernet interface). And that's it. There is no FTP, SSH or even HTTP. There isn't even TCP/IP — there isn't even IP...

The choice we have made is therefore to re-implement all these protocols in OCaml and to manage all these "layers" without using libraries which are often made in C. We therefore need:

an implementation of the Ethernet protocol
an implementation of the ARP protocol
an implementation of the IPv4 protocol
an implementation of the ICMPv4 protocol
an implementation of the TCP protocol

Fortunately, we already have one in OCaml: mirage-tcpip but as mentioned above, we would like to take advantage of certain opportunities that would be rather difficult to seize in this project.

A note about effects

From our point of view, effects unlocks a new capacity for abstraction. It allows us not only to express a whole host of things in a different way, but above all to express the interactions with a scheduler that does not involve a monad ('a t).

It is above all this aspect that is interesting for us because we can then think about other monads that can fit together easily without considering the question of the scheduler (lwt or async). This allows us, in particular, and basically, to play more easily with the Result monad, for example.

The idea is therefore not to revolutionize the way we make our unikernels but above all to unlock other methods of abstraction that until now have required "tricks" such as higher-kinded polymorphism.

At the beginning, Ethernet

Ethernet is the lowest layer. It consists of a header containing the source MAC address and the destination MAC address. Then there is a payload², which can be an IPv4 or IPv6 packet or ARPv4, for example.

type packet =
  { src : Macaddr.t
  ; dst : Macaddr.t
  ; payload : Slice_bstr.t }

This is where we would like to use our hypercall (Miou_solo5.Net.read_bigstring) to read the incoming Ethernet frames. This is also where we will have our only blocking hypercall. Considering Miou, the design is very simple: launch a background task, a daemon, which will attempt to read the incoming frames. It will then attempt to decode these frames and determine whether our unikernel is the destination (according to its MAC address specified by Solo5).

let rec daemon t =
  (* we are blocked here until a new "frame" comes *)
  let len = Miou_solo5.Net.read_bigstring t.net t.bstr_ic in
  (* we received a new Ethernet frame of [len] bytes. *)
  let ok ({ Packet.protocol; src; dst }, payload) =
    match protocol with
    | None -> ()
    | Some protocol ->
      (* we recognized a protocol (IPv4 or ARPv4) *)
      let packet = { src= Some src; dst; protocol; payload } in
      (* is it for us? *)
      if Macaddr.compare dst t.mac == 0
      || Macaddr.is_unicast dst == false
      then t.handler packet (* call the handler to handle upper layer *) in
  let error = Fun.const () in
  let () = Result.fold ~ok ~error (Packet.decode t.bstr_ic ~len) in
  daemon t

However, there is one point that needs clarifying. In the Miou model, when it comes to executing the handler given by the user, it is preferable to do it using Miou.async: in other words, to execute the handler cooperatively with the daemon.

let rec clean orphans = match Miou.care orphans with
  | Some None | None -> ()
  | Some (Some prm) -> Miou.await_exn prm; clean orphans
   
let rec daemon t =
  clean t.orphans; (* clean-up terminated children *)
  let client = accept () in (* accept a new incoming connection *)
  let _ = Miou.async ~orphans:t.orphans (t.handler client)
    (* launch a new task from the user's handler function and add it as a
       children into the [orphans] value. *)
  daemon t

It may even be more interesting to run the handler in parallel with Miou.call. This is particularly the case with httpcats, where the HTTP request handler is run in parallel with the "main loop". But Solo5 has only one core. So why this choice?

Implementation at this stage is really important, particularly with regard to performance. Allocating a new task (Miou.async) to execute the user's handler can have a significant cost. In fact, creating a new task consists of interacting with Miou, giving it the possibility of initializing everything that is necessary internally for this new task and finally executing it. Here, the choice is precisely not to cooperate at this level but to effectively "continue" towards the user's handler without interruption.

The fact that there is no interruption assures us of one thing, between what our daemon can handle and the user handler, Miou did not have the opportunity to do anything else. More specifically, Miou did not have the opportunity to try to read a new packet. This allows us to introduce a new concept related to bigarrays.

If you look at the implementation of our daemon, we try to read a frame and save it in a bigarray t.bstr_ic. What is noted here is that we have pre-allocated t.bstr_ic and use it (and reuse it) to read our frames. After reading a frame, it is therefore in the t.bstr_ic and can be analyzed. If we then gave Miou the opportunity to do something else (by creating, for example, a cooperative task that would like to continue analyzing the frame — on the IP layer, for example), we would be in a situation where we would want to cooperatively write a new frame in t.bstr_ic but also analyze our first frame in our cooperative task.

The problem described above is an ownership problem. That is to say, two tasks can read and write to the same bigarray. We would like to be able to express that as long as our bigarray is being analyzed, Miou should not try to write a new frame on it. A form of exclusion in the execution of our two tasks, being sure that the analysis takes ownership first. This is for example what can be expressed with rust but we are in OCaml...

One solution would be to use a mutex and be sure of this exclusion once and for all. Another way, and this is the one we are adopting here, is not to give Miou the opportunity to do anything else and simply to continue analyzing our bigarray, knowing that we only have one core (in the case of multiple cores, this assertion is no longer true).

The idea here is, from the Ethernet frame, to try to go as far as possible in the OSI model without giving Miou the opportunity to switch tasks, but also to keep ownership of our bigarray throughout and thus avoid copying it.

2: At this stage, we introduce the Slice_bstr.t type, which more or less corresponds to Cstruct.t. The question of strings versus Cstruct.t/Slice_bstr.t arises later (on the IPv4 layer).

Then, ARP

The ARP layer is fairly simple to understand; it is a cache that associates IP addresses with MAC addresses. The purpose of this layer is to announce to the network that it has a specific IP address, but also to be able to ask where an IP address is located (and to obtain a MAC address from it).

However, we are starting to manage asynchronous tasks here. In fact, when it comes to sending an IPv4 packet to a specific IP address, we need to:

write an ARP packet to ask the network for the MAC address of an IP address
wait for an ARP packet from the network informing us of the MAC address
then be able to write an IPv4 packet to the MAC address received from the network

It is also here that the cache is very important, as we wouldn't want to make this request every time we wanted to send an IPv4 packet. In general, we only do it once and we keep the entry for a certain time.

However, our process of writing a new IPv4 packet is indeed awaiting a response from the ARP layer: a response from the cache or the network.

let write_ipv4 t ~(dst:Ipaddr.V4.t) ipv4_pkt =
  let ivar = Routing.destination_macaddr t.arpv4 dst in
  (* who-has [dst]? *)
  let macaddr = Miou.Computation.await_exn ivar in
  (* [dst] is-at [macaddr]. *)
  Ethernet.write t.eth t.src ~dst ipv4_pkt

Miou offers a Miou.Computation module that allows a value to be transferred from one task to another (in an atomic way — it is therefore domain-safe). Thus, in the above example, we can wait for the value from our Miou.Computation.t and the ARP layer will "fill in" our Miou.Computation.t with the response from the cache (therefore without latency) or from the network (after reading and decoding an ARP packet).

The value Miou.Computation.t can be seen as an IVar, only one value can be transferred. Miou then ensures that this transfer can be done cooperatively but also in parallel.

Followed by IPv4

IPv4 is perhaps the most difficult layer because we can observe the fragmentation of packets. In fact, we have not yet specified it, but we cannot write packets of unlimited size on the network. IPv4 packets are limited by what is called the MTU (Maximum transmission unit).

If the user wants to write an IPv4 packet of 2000 bytes when the MTU is 1500 (which is the most common value), our unikernel must "split" the packet into several fragments so that it respects the MTU. The reverse is also true, a packet can be fragmented and it is up to us to reassemble it. This can happen, for example, if a router on the network has a small MTU — i.e. our unikernel may have an MTU of 1500 bytes, there may be fragmentation at a smaller MTU depending on the network. One last "detail": the fragments do not appear in order!

Here we are going to have a little fun with GADTs in order to specify the IPv4 packets that we receive. We can receive 3 types of IPv4 packets:

a non-fragmented packet, which is the most common. It is not necessary to "reassemble" it. It can be given as is to the upper layer (TCP) and we can continue to do so without copying if we still do not give Miou the opportunity to perform another task
a fragmented packet that contains the "end" of our "payload". This packet is particularly interesting because we know the final size of our payload. We therefore preallocate an entire buffer that can contain all the fragments that we are going to receive. We just have to check that we are not writing on a fragment that has already been received and when this preallocated buffer will be full (that all the fragments that compose it have been received).
a fragmented packet, this last one is the worst case scenario because we do not know the total size of the packet and we still have to "concatenate" this received fragment with other fragments into something incomplete that would gradually become our complete packet.

type fragmented = Fragmented
type unfragmented = Unfragmented
type 'a payload =
  | Unsized : Ropes.unknown Ropes.t -> fragmented payload
  | Sized : Diet.t * bytes -> fragmented payload
  | Unfragmented : Slice_bstr.t -> unfragmented payload
type t = Payload : 'a payload -> t [@@unboxed]

Fragmented packets with an unknown size

As we have said, we may receive a fragment that does not inform us of the final size of the payload. It just informs us that the given fragment has a position in the final payload, its offset.

First of all, we must save the given fragment. Remember that we are currently manipulating the bigarray also used by the Ethernet module to read the next frame. If we don't make a copy and only use this bigarray, its content will change and we won't be able to construct the final payload correctly.

Then, the reception of several of these fragments implies a form of concatenation of the latter. At this stage, we would like to imagine this concatenation without actually doing it (such as fragment0 ^ fragment1). We will therefore use a fairly well-known structure: rope.

A rope is a simple binary tree in which our fragments will be found. In this tree, a node corresponds to what we imagine to be the concatenation between the left part and the right part. This allows us to keep the idea that there is indeed a concatenation without actually doing it. Concatenation of bytes is costly. It involves the creation of a third bytes and the copying of our fragments.

Using a rope here consists, with regard to concatenation, of just creating a new node. I recommend the implementation of Filliâtre but also its proven version.

Now, our rope is a bit special. In fact, the only parameter that we don't know is the final size of our payload. So our rope has holes, some nodes have a known size (for example, if we have a fragment at offset 1480, this means that the left part has a known size of 1480 bytes) while others (which we will find mainly in the right part of our nodes) have an unknown size.

type fixed = Fixed
type unknown = Unknown

type 'a t =
  | Fragment : string -> fixed t
  | Hole : 'a size -> 'a t
  | Concat : fixed t * 'a t * int * 'a size -> 'a t
    (* left part, right part, size of the left part, size of the right part *)
and 'a size =
  | Length : int -> fixed size
  | Limitless : unknown size

I will spare you the implementation of the insert function, which allows you to add a string to a particular offset. The use of GADTs here allows us, above all, to move from an unknown state to a fixed state as soon as we receive the last fragment of our payload: as soon as we receive this fragment, we will know the final size of our payload.

let ( <+> ) : type a. a size -> int -> a size = function
  | Length a -> fun b -> Length (a + b)
  | Limitless -> fun _ -> Limitless

let length : type a. a t -> a size = function
  | Hole v ->  v
  | Fragment str -> Length (String.length str)
  | Concat (_, _, _, Limitless) -> Limitless
  | Concat (_, _, ls, rs) -> rs <+> ls

let rec fixed : max:int -> unknown t -> fixed t
  = fun ~max -> function
  | Hole Limitless -> Hole (Length max)
  | Concat (l, r, ls, Limitless) ->
      let r = fixed ~max:(max - ls) r in
      let rs = length r in
      App (l, r, ls, rs)

Since we can "fix" our rope, we can now "flatten" it. That is to say: effectively apply concatenation on a bytes that we are going to allocate. However, receiving the last fragment does not yet mean that we have received our entire payload. In reality, there may still be holes. But we can now allocate our final payload and we will associate a new structure that will inform us of the current holes and fragments that exist in our final payload: a discrete interval encoding tree.

Fragmented packets with a known size

A diet is also a tree, but this time with intervals that allow us to know which area has been filled (with a fragment) and which area is free for future “filling” with the arrival of a future fragment.

Here, we are already more or less preparing the ground for what we could pass on to the layer above (TCP), a complete payload but in the form of bytes (remember that from now on we would prefer to use bytes rather than systematically having Cstruct.t). So here's how to go from a "fixer" rope to our final payload + our diet:

let to_bytes : fixed t -> Diet.t * bytes = fun t ->
  let Length len = length t in
  let buf = Bytes.create len in
  let rec go diet off = function
    | Str str -> 
      let len = String.length str in
      Bytes.blit_string str 0 buf off len;
      Diet.add ~off ~len diet
    | Hole (Length 0) -> diet
    | Hole (Length len) ->
      Bytes.fill buf off len '\000';
      Diet.add ~off ~len diet
    | App (l, r, ls, _) ->
      let diet = go diet off l in
      go diet (off + ls) r in
  let diet = go Diet.empty 0 t in diet, buf

Inserting a new fragment therefore consists of checking that the area we would like to "fill" is free and making the copy.

let insert (t : fragmented payload) ~off fragment =
  match t, limit with
  | Sized (diet, buf) ->
      let len = String.length fragment in
      if off < 0
      || off > Bytes.length buf - len
      then raise Out_of_bounds;
      begin match Diet.add ~off ~len diet with
      | Ok diet ->
        Bytes.unsafe_blit_string fragment 0 buf off len;
        Sized (diet, buf)
      | Error `Overlap -> raise Overlap end
  | Unsized _ -> ...

The last question is to find out when our payload is "complete". Our diet evolves and contains several intervals corresponding to the places we have filled. The objective here is to know if all of these intervals are equivalent to the interval [0; end_of_payload[ — if this is the case, it means that we no longer have any holes. The diet structure has a very inexpensive operation that allows us to make a difference between our intervals and our interval corresponding to the totality of our payload: Diet.diff. If the result of this diff is "empty", it means there are no more holes!

let is_complete : type a. a payload -> bool = function
  | Unsized _ -> false
  | Sized (diet, buf) ->
      let complete = Diet.add ~off:0 ~len:(Bytes.length buf) Diet.empty in
      Diet.(is_empty (diff diet complete))
  | Unfragmented _ -> true

Reassembly

All this allows us to implement what is perhaps the simplest algorithm for reassembling the fragments:

let reassemble_exn : fragmented payload -> string = function
  | Unsized _ -> invalid_arg "Fragment.reassemble_exn: unsized payload"
  | Sized (_, buf) as payload ->
    if is_complete payload then Bytes.unsafe_to_string buf
    else invalid_arg "Fragment.reassemble_exn: incomplete payload"

Multiple IPv4 packets and multiple fragments

Finally, when receiving an IPv4 packet, our objective is essentially to insert our IPv4 packet (fragmented or not) into a cache that would maintain the different payloads according to a unique ID given in the packet header and then to be able to obtain the payloads ready and reassembled to transfer them to the layer above: the TCP layer!

Here, we are just going to reuse what already exists with the lru library. We tried to see if another implementation could satisfy us, but it seems that Lru.M is perfectly suited to our needs and is quite efficient.

Even if we delegate this question of the cache to another library, the fact remains that there is an angle of attack on the IPv4 layer that can easily make any service based on our stack unavailable. The cache allows us to limit the number of fragments that we try to keep, but there are also two parameters in which we also invalidate the fragments:

if a fragment of an uid that we already have appears too late (10 seconds after the first one), we "drop" all the previous fragments and consider the incoming fragment as a new packet to be constructed
if we are receiving more than 16 fragments, we drop the entire packet

Finally, μTCP!

Finally, we can build the TCP stack from our IPv4 stack. At this stage, we can already see that we have achieved two of our objectives:

the construction of our TCP/IP stack is not done by the functors but by an assembly of values/daemons and handlers.
the IPv4 stack translates incoming packets in the form of a variant between a bigarray (for non-fragmented packets) and a string (for fragmented packets where it has been necessary to reassemble them).

type payload =
  | Slice of Slice_bstr.t
  | String of string

The idea of always being able to transfer the bigarray from the Ethernet layer is to unlock the possibility of having a zero-copy stack. Fragmentation of IPv4 packets is not common; we generally receive complete packets. The "happy-path" therefore consists of proposing a bigarray to μTCP in such a way that the latter can also avoid copying.

Then, the fact that it is a variant also makes it possible to dissociate the case where, with regard to the bigarray, we should be careful not to give Miou the opportunity to reschedule (and retain ownership of the bigarray).

The case of a string makes it possible to understand that there has been a copy (because of fragmentation) and that μTCP can do what it wants with it.

However, in reality, μTCP was developed with Cstruct.t (its development predates our transition about Cstruct.t/bytes on mirage-crypto). We are currently moving from this variant to a Cstruct.t that μTCP can use. At least the lower layers are designed to unlock the possibility of finally eliminating our use of Cstruct.t.

Await and signals

μTCP is very well abstracted in the sense that, like most of our projects, the library is independent of any schedulers. But what does μTCP really need?

μTCP has a global state. This state must be manipulated as soon as IPv4 packets are received, but also periodically according to a "tick".

When packets are received or when a "tick" (our unit of time) has been consumed, μTCP provides a new version of this global status as well as:

possibly IPv4 packets to be written
abstract "values" to notify the user (of data reception or of the finalization of data writing)

The type of these values is abstract. μTCP knows neither how to construct these values nor how to use these values. What is certain is that values are associated with events desired by the user.

First of all, we are going to construct this global state and the latter will know how to create these abstract values — because we will pass it a function creating this type of values.

Then, as soon as it comes to reading from a "socket", μTCP will create one of these values, it will give it to us and it will be up to us to use it. Using it consists of two things:

waiting for an event through this value
notifying the user that something has just happened, possibly while waiting for an event (such as the reception of bytes)

More generally, this waiting and notification mechanism is basically a Condition. It turns out that Miou offers: Miou.Condition.t. Unlike lwt (with Lwt_condition.t), Miou.Condition.t is closer to Stdlib.Condition.t. In this case, μTCP not only wants to wait for an event and notify the user of an event, but also to transfer this event. We will therefore associate a queue with our condition so that we can transfer these events:

module Notify = struct
  type 'a t =
    { queue : 'a Queue.t
    ; mutex : Miou.Mutex.t
    ; condition : Miou.Condition.t }
  let create () =
    { queue= Queue.create ()
    ; mutex= Miou.Mutex.create ()
    ; condition= Miou.Condition.create () }

  let signal value t =
    Miou.Mutex.protect t.mutex @@ fun () ->
    Queue.push value t.queue;
    Miou.Condition.signal t.condition

  let await t =
    Miou.Mutex.protect t.mutex @@ fun () ->
    while Queue.is_empty t.queue do
      Miou.Condition.wait t.condition t.mutex
    done;
    Queue.pop t.queue
end

Finally, we will create a daemon (just like Ethernet) that will execute a μTCP function at each tick (every 100ms) in order to update the global state.

let rec daemon state n =
  let tcp, drops, outs = Utcp.timer state.tcp (now ()) in
  state.tcp <- tcp;
  let fn = write_ipv4_packet state.ipv4 in
  List.iter fn outs;
  let fn (_id, err, rcv, snd) =
    let err = match err with
      | `Retransmission_exceeded -> `Msg "retransmission exceeded"
      | `Timer_2msl -> `Eof
      | `Timer_connection_established -> `Eof
      | `Timer_fin_wait_2 -> `Eof in
    let err = Error err in
    Notify.signal err rcv;
    Notify.signal err snd in
  List.iter fn drops;
  Miou_solo5.sleep 100_000_000;
  daemon state (n+1)

And that's it, μTCP only needs this principle of Condition and the notion of time. We can now look at the API that we want to offer to:

create a socket
read and write on a socket
listen on a specific port

Results

After a few bugs, we managed to make a unikernel with OCaml 5 that can not only respond to our pings (via the ICMPv4 protocol) but also repeat what we send it via TCP! The big advantage is that we can now write this kind of code in our unikernel:

let echo flow () =
  let finally () = TCPv4.close flow in
  Fun.protect ~finally @@ fun () ->
  let buf = Bytes.create 0x7ff in
  let rec go () =
    (* read *)
    let len = TCPv4.read flow buf ~off:0 ~len:(Bytes.length buf) in
    if len > 0
    then go (TCPv4.write flow (Bytes.sub_string buf 0 len)) in (* and write *)
  go ()

let rec clean orphans = match Miou.care orphans with
  | None | Some None -> ()
  | Some (Some prm) -> Miou.await_exn prm; clean orphans

let run _quiet cidr gateway =
  Miou_solo5.(run [ tcpv4 ~name:"service" ?gateway cidr ]) @@ fun (daemon, tcpv4) () ->
  let rng = Mirage_crypto_rng_miou_solo5.initialize (module Mirage_crypto_rng.Fortuna) in
  let finally () =
    Mirage_crypto_rng_miou_solo5.kill rng;
    kill daemon in
  Fun.protect ~finally @@ fun () ->
  let m = TCPv4.listen tcpv4 8080 in (* listen on *:8080 *)
  let rec go orphans =
    clean orphans;
    let flow = TCPv4.accept tcpv4 m in
    ignore (Miou.async ~orphans (echo flow));
    go orphans in
  go (Miou.orphans ())

For those who would like to understand this code a little better, I invite you to read our little tutorial on Miou on how to make an echo server available here.

We can now start to imagine the possible implementation of an HTTP server with TLS as we can already offer with httpcats & vif: but this time, in the form of unikernel!

However, before considering the best, let's look at the worst:

Currently, there are still some bugs on μTCP. Even though we are currently using this stack for our cooperative's TLS proxy, we still find some bugs here and there that can be quite hairy.
Performance is not as good as what we currently have with mirage, mirage-tcpip and lwt. There are several reasons for this, both macro (considering the design of Miou and the design of lwt) and micro (checksum, allocation of Cstruct.t, etc.).

In short, we will take the time to describe all this in a possible future article. But at this stage, at least we have something that works! The next article will surely cover three things that are essential at this stage: the performance of μTCP and miou-solo5, how to "profile" a unikernel and obtain metrics in order to direct our optimizations, and the existential question of the availability of our unikernel versus performance.

Our work is only partially funded, we cross-fund our work by commercial contracts and public (EU) funding. We are part of a non-profit company, you can make a (in the EU tax-deductable) donation or sponsor us via the GitHub sponsor button.