blog.robur.coop

The Robur cooperative blog.
Back to index

A webapp to search emails as an unikernel

2025-12-04

We are pleased to announce the development of our new unikernel: blame. It is a unikernel that allows you to launch a webapp for searching emails from an archive.

If you want to follow this article, we recommend reading two other articles:

blame is therefore the synthesis of these two works, but in the form of and unikernel. We will therefore elaborate on this part a little more and show what the new era of unikernels in OCaml could be.

How to compile & use blame?

blame is part of a series of projects that are trying out a new workflow in unikernel development. This project no longer requires the mirage tool and requires OCaml 5, using effects and Miou as a scheduler. In this sense, it is quite different from what is currently available in terms of unikernels in OCaml. We will therefore take a detailed look at how to build such a unikernel and how it was developed.

There are three main constraints for building a unikernel:

  1. All C sources must be compiled with certain options (notably the -ffreestanding option and include Solo5 headers).
  2. The POSIX interface is not available (and therefore neither is the unix.cmxa module).
  3. the linking for object files is different from the one used to produce a simple executable (we notably do a static link)

Given these constraints, producing a unikernel from OCaml code requires a compilation context (or a toolchain). Fortunately, dune allows you to describe such a context, and we have been offering such a toolchain for quite some time: ocaml-solo5. This means that building a unikernel essentially involves compiling a project (always bearing in mind that the Unix module does not exist) with our modified OCaml compiler (so that the compilation of C files has the expected options and the linking corresponds to the construction of a unikernel).

If you are interested, slides on this topic are available here.

The problem is that there are projects (such as mirage-crypto) that contain C files. If we try to link mirage_crypto.cmxa available in our OPAM switch with our unikernel, there would be an error (since the C files would not have been compiled with our modified OCaml compiler but instead with the host compiler).

This is where dune offers us another solution: we can "vendor" the dependencies. So, the first step in building blame is to download the sources of the dependencies.

The fundamental difference with the mirage tool

At this point, experts might say that this is what the mirage tool does. However, there is a subtle (yet very problematic) step that the mirage tool performs that we no longer do.

blame is a unikernel. It is not an executable and is a unikernel specific to Solo5. In this case, the dependencies needed to build blame are known "in advance": we know that we will use mirage-crypto-rng-mkernel or vifu.

The mirage tool had a step that consists of choosing what we wanted to produce and, based on that choice, resolving the dependencies needed for the unikernel. For blame, that is no longer the case: there is no longer a resolution phase.

In reality, we leave this phase to opam, which means that attempting to install blame means letting opam resolve the dependencies needed for blame (as it can do for any software available in opam-repository). The sources of these dependencies (installed) can be re-obtained using opam source (because, let's not forget, we cannot use the mirage_crypto.cmxa from our switch but recompile it with our compiler).

Ultimately, the idea is essentially to remove the opam monorepo tool used by mirage, which never really gained consensus and caused problems in the unikernel development workflow.

Fetch and dune build

Ultimately, there is a small script in blame that simply calls opam source. Finally, it just consists of calling dune build (as we do for any dune project):

$ git clone https://git.robur.coop/robur/blame.git
$ cd blame
$ opam pin add --yes --no-action .
$ opam install --deps-only blame
$ ./source.sh
Successfully extracted to /home/dinosaure/dev/blame/vendors/bstr
Successfully extracted to /home/dinosaure/dev/blame/vendors/digestif
Successfully extracted to /home/dinosaure/dev/blame/vendors/vifu
...
$ dune build
$ file _build/solo5/main.exe
_build/solo5/main.exe: ELF 64-bit LSB executable, x86-64, version 1 (SYSV),
  statically linked, interpreter /nonexistent/solo5/, for OpenBSD,
  with debug_info, not stripped

Run an unikernel

This is where most people (myself included) have always struggled: running a unikernel. A unikernel is not an executable and needs to be allocated a CPU and memory to run (just like a real operating system). However, our unikernel should be able to:

  1. communicate via protocols (such as http)
  2. manipulate an email archive

We are talking here about devices, and Solo5 offers two types: net devices (which can be thought of as Ethernet ports) and block devices (which can be thought of as hard drives).

The second is easy to instantiate; it can be a simple file, the only constraint being that it must be block aligned. In this case, since our article on our archiving system, blaze allows you to create an archive from a list of emails:

$ opam install blaze
$ blaze pack make -o pack.pack --align 512 <<EOF
001.eml
002.eml
003.eml
...
EOF

As for the net device, this is a bit more complicated; we need to allocate a tap interface tap0 and a bridge service to which we would connect our tap0 interface:

sudo ip link add name service type bridge
sudo ip addr add 10.0.0.1/24 dev service
sudo ip tuntap add name tap0 mode tap
sudo ip link set tap0 master service
sudo ip link set service up
sudo ip link set tap0 up

In the second line we also setup a local network 10.0.0.x/24, and configure the host system to have 10.0.0.1. Our unikernel will connect to the tap0 interface (to be connected to our local network).

Then launch our unikernel using our Solo5 tender:

$ solo5-hvt --mem=512 --net:service=tap0 --block:archive=pack.pack -- \
  _build/solo5/main.exe --ipv4=10.0.0.2/24 --color=always

As such, we can recommend installing and using Albatross to deploy unikernels, as well as builds.robur.coop, which offers a series of downloadable unikernels.

Vif: a web framework for unikernels in OCaml

This is where we can introduce vifu: our OCaml web framework for unikernels. It is simply a specialisation of our vif web framework, currently used by our website builds.robur.coop! vif{,u} focuses on three points:

  1. Providing a web framework that works with our Miou scheduler—there are benchmarks on this subject, and our latest release of Miou improves them, notably by now using poll(2)/ppoll(2). You can see more details here.
  2. Offering typed routes that can also be paths based on regular expressions. Thanks to the work of Gabriel Radanne and our feedback, we have managed to create a very expressive DSL for building routes.
  3. Type the content of requests and handle the serialisation/deserialisation of application/json or multipart/form-data content (which allows us to offer typed forms). The idea here is essentially to manipulate OCaml values rather than letting the user manage this not-so-interesting part.

A tutorial showing all these aspects is available here and allows you to code a chat room in less than 300 lines of code!

We have gone even further by offering mnet, which implements the IPv4 layer required for unikernels. You can find more details on this in this article, as well as utcp, which is a formalized implementation of the TCP layer, and mhttp, which is simply a specialisation of httpcats but for unikernels!

Thus, launching a web server in the form of a unikernel essentially consists of:

let ( let@ ) finally fn = Fun.protect ~finally fn

let run _ cidr gateway port =
  let devices =
    let open Mkernel in
    [ Mnet.stackv4 ~name:"service" ?gateway cidr
    (* We instantiate a TCP/IPv4 stack *) ] in
  Mkernel.run (* like [Miou.run] *) devices @@ fun (daemon, tcpv4, _udpv4) () ->
  let rng = Mirage_crypto_rng_mkernel.initialize (module RNG) in
  let@ () = fun () ->
    Mirage_crypto_rng_mkernel.kill rng;
    Mnet.kill daemon in
  let cfg = Vifu.Config.v port in
  (* We create a configuration to listen on a specific [port]. *)
  let routes = [ ... ] (* Routes like [vif]. *) in
  Vifu.run ~cfg tcpv4 routes ()

By launching the unikernel on 10.0.0.2 (as above) and with port = 80, you should be able to access your website via http://10.0.0.2/!

backend & frontend

When it comes to developing a webapp, we often talk about the backend (the http server) and frontend (JavaScript scripts). As you can imagine, our profiles are more "backend" (even though we can boast of being true full-stack engineers!), but that doesn't mean we can't develop a "frontend", particularly thanks to js_of_ocaml. In this case, blame is a project for which we had to develop both sides.

The question remains: how can the frontend be served by the backend (when the latter is a unikernel)? This is where the compilation pipeline becomes interesting.

This is where the magic of dune comes into play, which consists of:

In our ecosystem, there is an old piece of software called ocaml-crunch that allows us to serialise the contents of a folder into an OCaml module. This module can then be linked with an executable (which can access the content without even performing I/O).

ocaml-crunch was developed in the context of mirage to still obtain the contents of files even though a unikernel has no concept of what a file system is (remember that a unikernel can only manipulate net devices or block devices!). But let's put our separation from the mirage tool behind us and just offer a tool that serialises the files given as arguments to an OCaml module:

let contents_of_filename filename =
  let ic = open_in_bin filename in
  let finally () = close_in ic in
  Fun.protect ~finally @@ fun () ->
  let tmp = Bytes.create 0x7ff in
  let buf = Buffer.create 0x7ff in
  let rec go () =
    match input ic tmp 0 (Bytes.length tmp) with
    | 0 -> Buffer.contents buf
    | len ->
        Buffer.add_subbytes buf tmp 0 len;
        go ()
    | exception End_of_file -> Buffer.contents buf
  in
  go ()

let run _quiet cfg filenames output =
  let ppf, finally =
    match output with
    | None -> (Fmt.stdout, ignore)
    | Some filename ->
        let oc = open_out_bin filename in
        let ppf = Format.formatter_of_out_channel oc in
        let finally () = close_out oc in
        (ppf, finally)
  in
  Fun.protect ~finally @@ fun () ->
  let fn (filename, name) =
    let str = contents_of_filename filename in
    Fmt.pf ppf "let %s = @[<hov>%a@]\n%!" name (Hxd_string.pp cfg) str
  in
  List.iter fn filenames

Here, we use hxd, which produces a very nice hexadecimal output (aligned with our formatter box). This program essentially consists of producing an OCaml module that we could reuse in our unikernel:

$ dune exec ./crunch.exe -- --with-comments -f script_js:_build/default/script.bc.js
let script_js = [| "\x2f\x2f\x20\x47\x65\x6e\x65\x72\x61\x74\x65\x64\x20\x62\x79\x20"    (* // Generated by  *)
                 ; "\x6a\x73\x5f\x6f\x66\x5f\x6f\x63\x61\x6d\x6c\x0a\x2f\x2f\x23\x20"    (* js_of_ocaml.//#  *)
                 ; "\x62\x75\x69\x6c\x64\x49\x6e\x66\x6f\x3a\x65\x66\x66\x65\x63\x74"    (* buildInfo:effect *)
                 ; "\x73\x3d\x64\x69\x73\x61\x62\x6c\x65\x64\x2c\x20\x6b\x69\x6e\x64"    (* s=disabled, kind *)
                 ; "\x3d\x65\x78\x65\x2c\x20\x75\x73\x65\x2d\x6a\x73\x2d\x73\x74\x72"    (* =exe, use-js-str *)
                 ; "\x69\x6e\x67\x3d\x74\x72\x75\x65\x2c\x20\x76\x65\x72\x73\x69\x6f"    (* ing=true, versio *)
                 ; "\x6e\x3d\x36\x2e\x32\x2e\x30\x0a\x28\x66\x75\x6e\x63\x74\x69\x6f"    (* n=6.2.0.(functio *)
                 ...
                 ; "\x29\x29\x3b\x0a" |]                                                 (* ));.             *)

We just need to add two dune rules and add our new module to the modules field of our unikernel.

(executable
 (name script)
 (modules script)
 (modes js))

(rule
 (targets documents.ml)
 (deps script.bc.js)
 (action
  (with-stdout-to
   documents.ml
   (run
    %{exe:crunch.exe}
    --with-comments
    --file script_js:script.bc.js))))

(executable
 (name main)
 (modules main documents)
 (link_flags :standard -cclib "-z solo5-abi=hvt")
 ...)

On the unikernel side, we can finally directly access our Documents.script_js and serve it to our clients via http:

let from_documents ~mime contents =
  (* we compute an Etag *)
  let hash =
    let rec go ctx idx =
      if idx >= Array.length contents then Digestif.SHA1.(to_hex (get ctx))
      else go (Digestif.SHA1.feed_string ctx contents.(idx)) (succ idx)
    in
    go Digestif.SHA1.empty 0
  in
  fun req _server () ->
    let open Vifu.Response.Syntax in
    let* () = Vifu.Response.add ~field:"content-type" mime in
    let hdrs = Vifu.Request.headers req in
    let if_none_match =
      match Vifu.Headers.get hdrs "if-none-match" with
      | Some hash' -> String.equal hash' hash
      | None -> false
    in
    if if_none_match then
      (* we use the cache of the webbrowser *)
      let* () = Vifu.Response.empty in
      Vifu.Response.respond `Not_modified
    else
      (* we just transmit the contents of our document *)
      let from = Flux.Source.array contents in
      let* () = Vifu.Response.add ~field:"Etag" hash in
      let* () = Vifu.Response.with_source req ~compression:`DEFLATE from in
      Vifu.Response.respond `OK

(* we do that for our script.js generated by js_of_ocaml *)
let script = from_documents ~mime:"application/javascript" Documents.script_js

let run _ cidr gateway port =
  ...
  let routes = 
    let open Vifu.Route in
    let open Vifu.Uri in
    (* and we just need add our new route *)
    [ get (rel / "script.js" /?? any) --> script ] in
  Vifu.run ~cfg tcpv4 routes ()

Archive and memory consumption

This is perhaps where it matters most. After playing around with dune and js_of_ocaml and setting up a compilation pipeline to produce a unikernel that acts as a backend but can also serve the scripts needed for the frontend, we can talk about another major constraint for unikernels: memory!

OCaml has a garbage collector, but it can be easily overwhelmed, ultimately exceeding its authorised usage. In general, we prefer unikernels that use as little memory as possible. This can range from 32MB to 512MB (which is starting to be enormous for a unikernel).

In our article on archiving, we handled a large number of emails (around 30k) and archives of around 150MB. It is fine for these archives to be in the form of a block device, as we do not necessarily need to load our entire archive into memory in order to read our emails.

If we thought this way, our unikernel would struggle to handle archives of this size, and we need to be smart about what we really want to get out of this archive without putting pressure on the GC. So loading all the emails into memory in order to analyse them is clearly not an option.

However, in our previous article on our search engine (and this is what this unikernel is about), it is still necessary to have at least the frequencies of the root words for each document in order to calculate their scores for each search query.

The first test of such memory usage for an archive of 20k documents was impossible with a unikernel having 512MB of memory... So let's rethink things a little. Since it seems so simple to be consistent in the construction of our unikernel between the frontend and the backend, could we not move the problem to the client side? A unikernel should use at most 512MB of memory, while a client has a web browser such as Firefox or Chromium that already uses GB of memory 😈!

Transmitting informations between backend and frontend

As we said about Vif, the web framework attempts to manage the serialization and deserialization of information for you—in addition, via the JSON format. With regard to blame, the server will essentially try to transmit documents to the JavaScript script, and since we have a consistent compilation pipeline, blame is structured so that:

Our common format:

type 'uid document =
  { length : int
  ; mail : 'uid
  ; blob : 'uid
  ; tokens : (string * int) list }

let token =
  let open Jsont in
  let stem = Object.mem "stem" ~enc:(fun (a, _) -> a) string in
  let count = Object.mem "count" ~enc:(fun (_, b) -> b) int in
  let fn stem count = (stem, count) in
  Object.map fn |> stem |> count |> Object.finish

let stem ~uid =
  let open Jsont in
  let length = Object.mem "length" ~enc:(fun t -> t.length) int in
  let mail = Object.mem "mail" ~enc:(fun t -> t.mail) uid in
  let blob = Object.mem "blob" ~enc:(fun t -> t.blob) uid in
  let tokens = Object.mem "tokens" ~enc:(fun t -> t.tokens) (list token) in
  let fn length mail blob tokens = { length; mail; blob; tokens } in
  Object.map fn |> length |> mail |> blob |> tokens |> Object.finish

The request handlers to download a certain number of stem documents:

let stem_of_uid pack uid =
  let size = Carton.size_of_uid pack ~uid Carton.Size.zero in
  let blob = Carton.Blob.make ~size in
  let value = Carton.of_uid pack blob ~uid in
  match Carton.Value.kind value with
  | `A | `B | `D -> Fmt.invalid_arg "Invalid stem object"
  | `C ->
      let str = Carton.Value.string value in
      let stem = Stem.of_string str in
      let mail, blob, length, tbl = Result.get_ok stem in
      let tokens = List.of_seq (Hashtbl.to_seq tbl) in
      let mail = Carton.Uid.unsafe_of_string mail
      and blob = Carton.Uid.unsafe_of_string blob in
      { Format.mail; blob; length; tokens }

let pstem pack req _server () =
  let open Vifu.Response.Syntax in
  try
    match Vifu.Request.of_json req with
    | Ok uids ->
        let pack = Carton.copy pack in
        let ts = List.map (stem_of_uid pack) uids in
        let fmt = Format.stem ~uid:juid in
        (* serialization *)
        let* () = Vifu.Response.with_json req (Jsont.list fmt) ts in
        Vifu.Response.respond `OK
    | Error _ ->
        let* () = Vifu.Response.with_text req "Invalid JSON object!\n" in
        Vifu.Response.respond `Bad_request
  with _exn ->
    let* () = Vifu.Response.with_text req str in
    Vifu.Response.respond `Not_found

The frontend which download stem documents according a list of unique ID:

let download uids =
  let fn uids =
    let method' = jstrf "POST"
    and json = Jsont_brr.encode Jsont.(list string) uids
    and headers =
      Brr_io.Fetch.Headers.of_assoc
        [ (jstrf "Content-Type", jstrf "application/json") ]
    in
    let json = Result.get_ok json in
    let body = Brr_io.Fetch.Body.of_jstr json in
    let init = Brr_io.Fetch.Request.init ~body ~headers ~method' () in
    let req = Brr_io.Fetch.Request.v ~init (jstrf "/stems") in
    let* resp = Brr_io.Fetch.request req in
    let body = Brr_io.Fetch.Response.as_body resp in
    let* stems = Brr_io.Fetch.Body.json body in
    let fmt = Format.stem ~uid:Jsont.string in
    (* deserialization *)
    let stems = Jsont_brr.decode_jv (Jsont.list fmt) stems in
    Fut.return stems
  in
  let stems = bulk 50 uids (* download documents 50 at a time *) in
  let documents = List.map fn stems in
  let* documents = Fut.of_list documents |> Fut.map Result.ok in
  let documents = List.filter_map Result.to_option documents in
  Fut.return (Ok (List.flatten documents))

Results

And that's how we shift the entire memory load of emails to the client side! The metric we have is that a unikernel with 256MB of memory can handle ~30k emails (and a 176M archive) using only ~140MB memory!

In our previous article, we suggested pre-calculating email occurrences and saving this calculation in the archive. We decided to do this to further reduce the unikernel's memory usage (and its boot time at the same time).

Once the script has downloaded everything, it just needs to make one last request so that the unikernel stems the request and the script can score the documents according to the stemmed query. This calculation is not that difficult, it's just a List.map on the ~20k documents that the web browser has and can do quite easily.

Here is a small GIF to show the result on our ~30k e-mails!

A simple view of our webapp

The caml-list archive

One of our goals is to eventually offer blame for the caml-list. The most interesting source I could find was carefully proposed by Nicolås Ojeda Bär here.

We therefore tried to transform this repository into an archive, and the metrics are interesting. The first thing to note is that the archive contains a total of 65837 emails that Mr.MIME was able to parse. Only 15 emails could not be processed for various reasons (but legitimate ones, as these emails are completely invalid according to the RFCs and even your email client would not be able to display them correctly!).

It should be noted that these errors did not occur when we ran our tests on LKML (the French did not know, at the time of ISO-8859, how to write valid emails...).

The archive is 217M for a total of 427M emails. The ratio is therefore 2, but, and this is the big change since our last article on our archiving system, the archive contains not only emails but also dictionaries associated with the documents (which contain the frequency of words) so that these calculations do not have to be redone when searching for an email.

As for documents, we have 69094 (yes, more than emails, because an email can contain several parts!). These are only text documents (we ignore PGP keys, images, etc.).

That's pretty much an overview of our caml-list archive!

Searching with our unikernel

We then tried running blame with this archive. The unikernel seems to require only ~350M of memory, which was to be expected. The JavaScript script starts to slow down quite a bit when handling these ~65k emails, but it's still "fairly" acceptable (though there are surely ways to improve the user experience). So we tested it a little and searched for a few emails. Among them, we found this email from Julian Assange talking about the old OCaml licence:

All this legalise is very interesting, but in the end, as with all legalise there is only one important issue:

a) Who has the motivation, resources and standing to fuck with you?

Hint: it's not INRIA.

But going further, we also found this email from Xavier Leroy regarding www.ocaml.org:

Julian Assange, the creator and maintainer of the www.ocaml.org website, is looking for a new maintainer to take care of that site.

As I understand it, this is a light job: just update the "News" section and add or update some links when appropriate. However, in the interest of continuity, it would be good if the new maintainer could commit for, say, a couple of years.

We (the OCaml team) could do the maintenance if we had to, but I like the idea of a Caml site that is run by enthusiastic users...

If there are any volunteers, please contact me and I'll put you in touch with Julian.

And, of course, many thanks to Julian for his efforts and support.

- Xavier Leroy

In short, this is perhaps the most important thing and what we wanted to achieve in the first place: to preserve a certain heritage full of knowledge such as this:

I was constructing a binary heap of tuples the other day. After pondering these options, I just used Obj.magic 0 as the null value in the array. The heap was in a module, so nothing else could see the array, and I could prove that the code never accessed the null elements, so the use of Obj.magic seemed justified.

In other terms:

" I was walking in the city the other day. I saw a syringe lying on the sidewalk. I stuck the needle in my forearm. That was a classy neighborhood, so the use of the syringe seemed justified. "

Sorry for being sarcastic, but I strongly feel that any suggestion to use Obj functions should be avoided on this list. The OCaml compiler performs some type-dependent optimizations that can result in incorrect code (w.r.t. GC invariants) if wrong types are given using Obj.magic.

- Xavier Leroy

Conclusion

blame is still experimental because it relies on libraries such as mnet, utcp, vifu, and mkernel, which are not yet fully finalized. However, at this stage, it has been very enjoyable to develop a website in OCaml in the form of a unikernel, to transmit information from one side of the web app to the other, and to avoid Out_of_memory errors even when handling large amounts of information.

It is also the culmination of a long process in which we achieved several objectives, such as:

It is also interesting to look beyond the limits in terms of memory usage. From our observations, we can see that the different stacks (ip, tcp, or http) do not have any memory leaks (unlike mirage-tcpip). Instead, we observe "stable" memory usage, where our unikernel allocates everything at once and does not end up using more memory than it has. From what we know about GC, it attempts to maintain a ratio between live data and the space available for new allocations.

Of course, GC is not aware of the unikernel's limits (and this may be the major flaw in OCaml 5 at present), but these experiments allow us to go further and with greater confidence in the development of services in the form of unikernels.

So, happy hacking "discriminating hackers"!