μTCP, Miou and unikernels
2025-03-24You may know that we recently enabled the compilation of unikernels with OCaml 5! This rather impressive work, which synthesizes the participation of several people, has unlocked a new possibility for unikernels: the use of effects.
Before going any further, the design of unikernels with Solo5 does not allow the use of multiple cores. This would require a significant amount of work on Solo5, which we are not necessarily very interested in: unikernels remain fairly simple and should only require a single core for the majority of their objectives (in our opinion) — namely, implementing small services that can fit into an existing infrastructure1.
Then, unikernels with OCaml 5 already work like our good old qubes-mirage-firewall. We took the time to test because OCaml 5 has a lot of changes in its runtime that could have broken behaviors that we expected for OCaml 4.14.
But what has changed now, and this is what I will try to show in this article, is the use of effects instead of lwt. In this case, we have developed our own scheduler Miou at Robur and it has been designed in particular to be used one day in a unikernel.
One last point about all this work: it takes advantage of a number of opportunities that we have been thinking about for quite some time:
- the first is to use fewer functors. Our abstractions available in ocaml-tar, utcp, ocaml-tls and decompress have stood the test of time in our opinion — much more so than functors.
- the second is to use bytes/string rather than bigstrings (even if the latter are still useful). Once again, it is difficult to talk about an improvement in performance, but it is worth a try.
- the third is to use μTCP instead of mirage-tcpip. We have encountered quite a few problems with the latter, and the code is becoming quite unmaintainable and it is in any case more or less linked to only use lwt.
A note about Solo5
Before we start talking code, we need to understand a little of what Solo5 has to offer and what we need to do to enable our unikernels to chat with the rest of the world.
Solo5 is very basic. It deliberates (it's in the name) 5 "hypercalls". That is to say 5 "functions" that enable interaction with external events:
- reading and writing on a block-device, a block device can be seen as a hard disk to which we have access "per page" (you can refer to Cachet, for example, to read/write in such a block-device). On the host system (the one that virtualizes the unikernel), it can be a simple file or an actual volume (like a ZFS volume) used by your host system.
- reading and writing on a net-device. A net-device can be seen as an ethernet socket. On the host system, it is called a TAP interface, which is a virtual interface that can be attached to a virtual bridge to create a "network".
- the unikernel shutdown, which is the last available hypercall!
You can see the documentation for these hypercalls here. That's all that Solo5 offers for the application domain.
Since these hypercalls, a unikernel can interact with the outside world via files or virtual network interfaces. For the latter, the only thing available to us is Ethernet packets with a destination (possibly our unikernel) and a source (another Ethernet interface). And that's it. There is no FTP, SSH or even HTTP. There isn't even TCP/IP — there isn't even IP...
The choice we have made is therefore to re-implement all these protocols in OCaml and to manage all these "layers" without using libraries which are often made in C. We therefore need:
- an implementation of the Ethernet protocol
- an implementation of the ARP protocol
- an implementation of the IPv4 protocol
- an implementation of the ICMPv4 protocol
- an implementation of the TCP protocol
Fortunately, we already have one in OCaml: mirage-tcpip but as mentioned above, we would like to take advantage of certain opportunities that would be rather difficult to seize in this project.
A note about effects
From our point of view, effects unlocks a new capacity for abstraction. It
allows us not only to express a whole host of things in a different way, but
above all to express the interactions with a scheduler that does not involve a
monad ('a t
).
It is above all this aspect that is interesting for us because we can then think
about other monads that can fit together easily without considering the question
of the scheduler (lwt or async). This allows us, in particular,
and basically, to play more easily with the Result
monad, for example.
The idea is therefore not to revolutionize the way we make our unikernels but above all to unlock other methods of abstraction that until now have required "tricks" such as higher-kinded polymorphism.
At the beginning, Ethernet
Ethernet is the lowest layer. It consists of a header containing the source MAC address and the destination MAC address. Then there is a payload2, which can be an IPv4 or IPv6 packet or ARPv4, for example.
type packet =
{ src : Macaddr.t
; dst : Macaddr.t
; payload : Slice_bstr.t }
This is where we would like to use our hypercall
(Miou_solo5.Net.read_bigstring
) to read the incoming Ethernet frames. This is
also where we will have our only blocking hypercall. Considering Miou, the design
is very simple: launch a background task, a daemon, which will attempt to read
the incoming frames. It will then attempt to decode these frames and determine
whether our unikernel is the destination (according to its MAC address specified
by Solo5).
let rec daemon t =
(* we are blocked here until a new "frame" comes *)
let len = Miou_solo5.Net.read_bigstring t.net t.bstr_ic in
(* we received a new Ethernet frame of [len] bytes. *)
let ok ({ Packet.protocol; src; dst }, payload) =
match protocol with
| None -> ()
| Some protocol ->
(* we recognized a protocol (IPv4 or ARPv4) *)
let packet = { src= Some src; dst; protocol; payload } in
(* is it for us? *)
if Macaddr.compare dst t.mac == 0
|| Macaddr.is_unicast dst == false
then t.handler packet (* call the handler to handle upper layer *) in
let error = Fun.const () in
let () = Result.fold ~ok ~error (Packet.decode t.bstr_ic ~len) in
daemon t
However, there is one point that needs clarifying. In the Miou model, when it
comes to executing the handler given by the user, it is preferable to do it
using Miou.async
: in other words, to execute the handler cooperatively with
the daemon.
let rec clean orphans = match Miou.care orphans with
| Some None | None -> ()
| Some (Some prm) -> Miou.await_exn prm; clean orphans
let rec daemon t =
clean t.orphans; (* clean-up terminated children *)
let client = accept () in (* accept a new incoming connection *)
let _ = Miou.async ~orphans:t.orphans (t.handler client)
(* launch a new task from the user's handler function and add it as a
children into the [orphans] value. *)
daemon t
It may even be more interesting to run the handler in parallel with Miou.call
.
This is particularly the case with httpcats, where the HTTP request
handler is run in parallel with the "main loop". But Solo5 has only one core. So
why this choice?
Implementation at this stage is really important, particularly with regard to
performance. Allocating a new task (Miou.async
) to execute the user's handler
can have a significant cost. In fact, creating a new task consists of
interacting with Miou, giving it the possibility of initializing everything that
is necessary internally for this new task and finally executing it. Here, the
choice is precisely not to cooperate at this level but to effectively
"continue" towards the user's handler without interruption.
The fact that there is no interruption assures us of one thing, between what our daemon can handle and the user handler, Miou did not have the opportunity to do anything else. More specifically, Miou did not have the opportunity to try to read a new packet. This allows us to introduce a new concept related to bigarrays.
If you look at the implementation of our daemon, we try to read a frame and
save it in a bigarray t.bstr_ic
. What is noted here is that we have
pre-allocated t.bstr_ic
and use it (and reuse it) to read our frames.
After reading a frame, it is therefore in the t.bstr_ic
and can be analyzed.
If we then gave Miou the opportunity to do something else (by creating, for
example, a cooperative task that would like to continue analyzing the frame — on
the IP layer, for example), we would be in a situation where we would want to
cooperatively write a new frame in t.bstr_ic
but also analyze our first frame
in our cooperative task.
The problem described above is an ownership problem. That is to say, two tasks can read and write to the same bigarray. We would like to be able to express that as long as our bigarray is being analyzed, Miou should not try to write a new frame on it. A form of exclusion in the execution of our two tasks, being sure that the analysis takes ownership first. This is for example what can be expressed with rust but we are in OCaml...
One solution would be to use a mutex
and be sure of this exclusion once and
for all. Another way, and this is the one we are adopting here, is not to give
Miou the opportunity to do anything else and simply to continue analyzing our
bigarray, knowing that we only have one core (in the case of multiple cores,
this assertion is no longer true).
The idea here is, from the Ethernet frame, to try to go as far as possible in the OSI model without giving Miou the opportunity to switch tasks, but also to keep ownership of our bigarray throughout and thus avoid copying it.
Slice_bstr.t
type, which more or less corresponds to
Cstruct.t
. The question of strings versus
Cstruct.t
/Slice_bstr.t
arises later (on the IPv4 layer).
Then, ARP
The ARP layer is fairly simple to understand; it is a cache that associates IP addresses with MAC addresses. The purpose of this layer is to announce to the network that it has a specific IP address, but also to be able to ask where an IP address is located (and to obtain a MAC address from it).
However, we are starting to manage asynchronous tasks here. In fact, when it comes to sending an IPv4 packet to a specific IP address, we need to:
- write an ARP packet to ask the network for the MAC address of an IP address
- wait for an ARP packet from the network informing us of the MAC address
- then be able to write an IPv4 packet to the MAC address received from the network
It is also here that the cache is very important, as we wouldn't want to make this request every time we wanted to send an IPv4 packet. In general, we only do it once and we keep the entry for a certain time.
However, our process of writing a new IPv4 packet is indeed awaiting a response from the ARP layer: a response from the cache or the network.
let write_ipv4 t ~(dst:Ipaddr.V4.t) ipv4_pkt =
let ivar = Routing.destination_macaddr t.arpv4 dst in
(* who-has [dst]? *)
let macaddr = Miou.Computation.await_exn ivar in
(* [dst] is-at [macaddr]. *)
Ethernet.write t.eth t.src ~dst ipv4_pkt
Miou offers a Miou.Computation
module that allows a value
to be transferred from one task to another (in an atomic way — it is therefore
domain-safe). Thus, in the above example, we can wait for the value from our
Miou.Computation.t
and the ARP layer will "fill in" our Miou.Computation.t
with the response from the cache (therefore without latency) or from the network
(after reading and decoding an ARP packet).
The value Miou.Computation.t
can be seen as an IVar
, only one value
can be transferred. Miou then ensures that this transfer can be done
cooperatively but also in parallel.
Followed by IPv4
IPv4 is perhaps the most difficult layer because we can observe the fragmentation of packets. In fact, we have not yet specified it, but we cannot write packets of unlimited size on the network. IPv4 packets are limited by what is called the MTU (Maximum transmission unit).
If the user wants to write an IPv4 packet of 2000 bytes when the MTU is 1500 (which is the most common value), our unikernel must "split" the packet into several fragments so that it respects the MTU. The reverse is also true, a packet can be fragmented and it is up to us to reassemble it. This can happen, for example, if a router on the network has a small MTU — i.e. our unikernel may have an MTU of 1500 bytes, there may be fragmentation at a smaller MTU depending on the network. One last "detail": the fragments do not appear in order!
Here we are going to have a little fun with GADTs in order to specify the IPv4 packets that we receive. We can receive 3 types of IPv4 packets:
- a non-fragmented packet, which is the most common. It is not necessary to "reassemble" it. It can be given as is to the upper layer (TCP) and we can continue to do so without copying if we still do not give Miou the opportunity to perform another task
- a fragmented packet that contains the "end" of our "payload". This packet is particularly interesting because we know the final size of our payload. We therefore preallocate an entire buffer that can contain all the fragments that we are going to receive. We just have to check that we are not writing on a fragment that has already been received and when this preallocated buffer will be full (that all the fragments that compose it have been received).
- a fragmented packet, this last one is the worst case scenario because we do not know the total size of the packet and we still have to "concatenate" this received fragment with other fragments into something incomplete that would gradually become our complete packet.
type fragmented = Fragmented
type unfragmented = Unfragmented
type 'a payload =
| Unsized : Ropes.unknown Ropes.t -> fragmented payload
| Sized : Diet.t * bytes -> fragmented payload
| Unfragmented : Slice_bstr.t -> unfragmented payload
type t = Payload : 'a payload -> t [@@unboxed]
Fragmented packets with an unknown size
As we have said, we may receive a fragment that does not inform us of the final size of the payload. It just informs us that the given fragment has a position in the final payload, its offset.
First of all, we must save the given fragment. Remember that we are currently manipulating the bigarray also used by the Ethernet module to read the next frame. If we don't make a copy and only use this bigarray, its content will change and we won't be able to construct the final payload correctly.
Then, the reception of several of these fragments implies a form of
concatenation of the latter. At this stage, we would like to imagine this
concatenation without actually doing it (such as fragment0 ^ fragment1
). We
will therefore use a fairly well-known structure: rope.
A rope is a simple binary tree in which our fragments will be found. In this
tree, a node corresponds to what we imagine to be the concatenation between the
left part and the right part. This allows us to keep the idea that there is
indeed a concatenation without actually doing it. Concatenation of bytes is
costly. It involves the creation of a third bytes
and the copying of our
fragments.
Using a rope here consists, with regard to concatenation, of just creating a new node. I recommend the implementation of Filliâtre but also its proven version.
Now, our rope is a bit special. In fact, the only parameter that we don't know is the final size of our payload. So our rope has holes, some nodes have a known size (for example, if we have a fragment at offset 1480, this means that the left part has a known size of 1480 bytes) while others (which we will find mainly in the right part of our nodes) have an unknown size.
type fixed = Fixed
type unknown = Unknown
type 'a t =
| Fragment : string -> fixed t
| Hole : 'a size -> 'a t
| Concat : fixed t * 'a t * int * 'a size -> 'a t
(* left part, right part, size of the left part, size of the right part *)
and 'a size =
| Length : int -> fixed size
| Limitless : unknown size
I will spare you the implementation of the insert
function, which allows you
to add a string to a particular offset. The use of GADTs here allows us, above
all, to move from an unknown
state to a fixed
state as soon as we receive
the last fragment of our payload: as soon as we receive this fragment, we will
know the final size of our payload.
let ( <+> ) : type a. a size -> int -> a size = function
| Length a -> fun b -> Length (a + b)
| Limitless -> fun _ -> Limitless
let length : type a. a t -> a size = function
| Hole v -> v
| Fragment str -> Length (String.length str)
| Concat (_, _, _, Limitless) -> Limitless
| Concat (_, _, ls, rs) -> rs <+> ls
let rec fixed : max:int -> unknown t -> fixed t
= fun ~max -> function
| Hole Limitless -> Hole (Length max)
| Concat (l, r, ls, Limitless) ->
let r = fixed ~max:(max - ls) r in
let rs = length r in
App (l, r, ls, rs)
Since we can "fix" our rope, we can now "flatten" it. That is to say:
effectively apply concatenation on a bytes
that we are going to allocate.
However, receiving the last fragment does not yet mean that we have received our
entire payload. In reality, there may still be holes. But we can now allocate
our final payload and we will associate a new structure that will inform us of
the current holes and fragments that exist in our final payload: a discrete
interval encoding tree.
Fragmented packets with a known size
A diet is also a tree, but this time with intervals that allow us to know which area has been filled (with a fragment) and which area is free for future “filling” with the arrival of a future fragment.
Here, we are already more or less preparing the ground for what we could pass on
to the layer above (TCP), a complete payload but in the form of bytes (remember
that from now on we would prefer to use bytes rather than systematically having
Cstruct.t
). So here's how to go from a "fixer" rope to our final payload + our
diet:
let to_bytes : fixed t -> Diet.t * bytes = fun t ->
let Length len = length t in
let buf = Bytes.create len in
let rec go diet off = function
| Str str ->
let len = String.length str in
Bytes.blit_string str 0 buf off len;
Diet.add ~off ~len diet
| Hole (Length 0) -> diet
| Hole (Length len) ->
Bytes.fill buf off len '\000';
Diet.add ~off ~len diet
| App (l, r, ls, _) ->
let diet = go diet off l in
go diet (off + ls) r in
let diet = go Diet.empty 0 t in diet, buf
Inserting a new fragment therefore consists of checking that the area we would like to "fill" is free and making the copy.
let insert (t : fragmented payload) ~off fragment =
match t, limit with
| Sized (diet, buf) ->
let len = String.length fragment in
if off < 0
|| off > Bytes.length buf - len
then raise Out_of_bounds;
begin match Diet.add ~off ~len diet with
| Ok diet ->
Bytes.unsafe_blit_string fragment 0 buf off len;
Sized (diet, buf)
| Error `Overlap -> raise Overlap end
| Unsized _ -> ...
The last question is to find out when our payload is "complete". Our diet
evolves and contains several intervals corresponding to the places we have
filled. The objective here is to know if all of these intervals are equivalent
to the interval [0; end_of_payload[
— if this is the case, it means that we no
longer have any holes. The diet structure has a very inexpensive operation that
allows us to make a difference between our intervals and our interval
corresponding to the totality of our payload: Diet.diff
. If the result of this
diff is "empty", it means there are no more holes!
let is_complete : type a. a payload -> bool = function
| Unsized _ -> false
| Sized (diet, buf) ->
let complete = Diet.add ~off:0 ~len:(Bytes.length buf) Diet.empty in
Diet.(is_empty (diff diet complete))
| Unfragmented _ -> true
Reassembly
All this allows us to implement what is perhaps the simplest algorithm for reassembling the fragments:
let reassemble_exn : fragmented payload -> string = function
| Unsized _ -> invalid_arg "Fragment.reassemble_exn: unsized payload"
| Sized (_, buf) as payload ->
if is_complete payload then Bytes.unsafe_to_string buf
else invalid_arg "Fragment.reassemble_exn: incomplete payload"
Multiple IPv4 packets and multiple fragments
Finally, when receiving an IPv4 packet, our objective is essentially to insert our IPv4 packet (fragmented or not) into a cache that would maintain the different payloads according to a unique ID given in the packet header and then to be able to obtain the payloads ready and reassembled to transfer them to the layer above: the TCP layer!
Here, we are just going to reuse what already exists with the lru
library. We tried to see if another implementation could satisfy us, but it
seems that Lru.M
is perfectly suited to our needs and is quite efficient.
Even if we delegate this question of the cache to another library, the fact remains that there is an angle of attack on the IPv4 layer that can easily make any service based on our stack unavailable. The cache allows us to limit the number of fragments that we try to keep, but there are also two parameters in which we also invalidate the fragments:
- if a fragment of an
uid
that we already have appears too late (10 seconds after the first one), we "drop" all the previous fragments and consider the incoming fragment as a new packet to be constructed - if we are receiving more than 16 fragments, we drop the entire packet
Finally, μTCP!
Finally, we can build the TCP stack from our IPv4 stack. At this stage, we can already see that we have achieved two of our objectives:
- the construction of our TCP/IP stack is not done by the functors but by an assembly of values/daemons and handlers.
- the IPv4 stack translates incoming packets in the form of a variant between a bigarray (for non-fragmented packets) and a string (for fragmented packets where it has been necessary to reassemble them).
type payload =
| Slice of Slice_bstr.t
| String of string
The idea of always being able to transfer the bigarray from the Ethernet layer is to unlock the possibility of having a zero-copy stack. Fragmentation of IPv4 packets is not common; we generally receive complete packets. The "happy-path" therefore consists of proposing a bigarray to μTCP in such a way that the latter can also avoid copying.
Then, the fact that it is a variant also makes it possible to dissociate the case where, with regard to the bigarray, we should be careful not to give Miou the opportunity to reschedule (and retain ownership of the bigarray).
The case of a string makes it possible to understand that there has been a copy (because of fragmentation) and that μTCP can do what it wants with it.
However, in reality, μTCP was developed with Cstruct.t
(its development
predates our transition about Cstruct.t
/bytes
on mirage-crypto
). We are
currently moving from this variant to a Cstruct.t
that μTCP can use. At least
the lower layers are designed to unlock the possibility of finally eliminating
our use of Cstruct.t
.
Await and signals
μTCP is very well abstracted in the sense that, like most of our projects, the library is independent of any schedulers. But what does μTCP really need?
μTCP has a global state. This state must be manipulated as soon as IPv4 packets are received, but also periodically according to a "tick".
When packets are received or when a "tick" (our unit of time) has been consumed, μTCP provides a new version of this global status as well as:
- possibly IPv4 packets to be written
- abstract "values" to notify the user (of data reception or of the finalization of data writing)
The type of these values is abstract. μTCP knows neither how to construct these values nor how to use these values. What is certain is that values are associated with events desired by the user.
First of all, we are going to construct this global state and the latter will know how to create these abstract values — because we will pass it a function creating this type of values.
Then, as soon as it comes to reading from a "socket", μTCP will create one of these values, it will give it to us and it will be up to us to use it. Using it consists of two things:
- waiting for an event through this value
- notifying the user that something has just happened, possibly while waiting for an event (such as the reception of bytes)
More generally, this waiting and notification mechanism is basically a
Condition
. It turns out that Miou offers: Miou.Condition.t
. Unlike lwt (with
Lwt_condition.t
), Miou.Condition.t
is closer to Stdlib.Condition.t
. In
this case, μTCP not only wants to wait for an event and notify the user of an
event, but also to transfer this event. We will therefore associate a queue with
our condition so that we can transfer these events:
module Notify = struct
type 'a t =
{ queue : 'a Queue.t
; mutex : Miou.Mutex.t
; condition : Miou.Condition.t }
let create () =
{ queue= Queue.create ()
; mutex= Miou.Mutex.create ()
; condition= Miou.Condition.create () }
let signal value t =
Miou.Mutex.protect t.mutex @@ fun () ->
Queue.push value t.queue;
Miou.Condition.signal t.condition
let await t =
Miou.Mutex.protect t.mutex @@ fun () ->
while Queue.is_empty t.queue do
Miou.Condition.wait t.condition t.mutex
done;
Queue.pop t.queue
end
Finally, we will create a daemon (just like Ethernet) that will execute a μTCP function at each tick (every 100ms) in order to update the global state.
let rec daemon state n =
let tcp, drops, outs = Utcp.timer state.tcp (now ()) in
state.tcp <- tcp;
let fn = write_ipv4_packet state.ipv4 in
List.iter fn outs;
let fn (_id, err, rcv, snd) =
let err = match err with
| `Retransmission_exceeded -> `Msg "retransmission exceeded"
| `Timer_2msl -> `Eof
| `Timer_connection_established -> `Eof
| `Timer_fin_wait_2 -> `Eof in
let err = Error err in
Notify.signal err rcv;
Notify.signal err snd in
List.iter fn drops;
Miou_solo5.sleep 100_000_000;
daemon state (n+1)
And that's it, μTCP only needs this principle of Condition and the notion of time. We can now look at the API that we want to offer to:
- create a socket
- read and write on a socket
- listen on a specific port
Results
After a few bugs, we managed to make a unikernel with OCaml 5 that can not only respond to our pings (via the ICMPv4 protocol) but also repeat what we send it via TCP! The big advantage is that we can now write this kind of code in our unikernel:
let echo flow () =
let finally () = TCPv4.close flow in
Fun.protect ~finally @@ fun () ->
let buf = Bytes.create 0x7ff in
let rec go () =
(* read *)
let len = TCPv4.read flow buf ~off:0 ~len:(Bytes.length buf) in
if len > 0
then go (TCPv4.write flow (Bytes.sub_string buf 0 len)) in (* and write *)
go ()
let rec clean orphans = match Miou.care orphans with
| None | Some None -> ()
| Some (Some prm) -> Miou.await_exn prm; clean orphans
let run _quiet cidr gateway =
Miou_solo5.(run [ tcpv4 ~name:"service" ?gateway cidr ]) @@ fun (daemon, tcpv4) () ->
let rng = Mirage_crypto_rng_miou_solo5.initialize (module Mirage_crypto_rng.Fortuna) in
let finally () =
Mirage_crypto_rng_miou_solo5.kill rng;
kill daemon in
Fun.protect ~finally @@ fun () ->
let m = TCPv4.listen tcpv4 8080 in (* listen on *:8080 *)
let rec go orphans =
clean orphans;
let flow = TCPv4.accept tcpv4 m in
ignore (Miou.async ~orphans (echo flow));
go orphans in
go (Miou.orphans ())
For those who would like to understand this code a little better, I invite you to read our little tutorial on Miou on how to make an echo server available here.
We can now start to imagine the possible implementation of an HTTP server with TLS as we can already offer with httpcats & vif: but this time, in the form of unikernel!
However, before considering the best, let's look at the worst:
- Currently, there are still some bugs on μTCP. Even though we are currently using this stack for our cooperative's TLS proxy, we still find some bugs here and there that can be quite hairy.
- Performance is not as good as what we currently have with mirage, mirage-tcpip and lwt. There are several reasons for this, both macro (considering the design of Miou and the design of lwt) and micro (checksum, allocation of Cstruct.t, etc.).
In short, we will take the time to describe all this in a possible future
article. But at this stage, at least we have something that works! The next
article will surely cover three things that are essential at this stage: the
performance of μTCP and miou-solo5
, how to "profile" a unikernel and obtain
metrics in order to direct our optimizations, and the existential question of
the availability of our unikernel versus performance.
Our work is only partially funded, we cross-fund our work by commercial contracts and public (EU) funding. We are part of a non-profit company, you can make a (in the EU tax-deductable) donation or sponsor us via the GitHub sponsor button.