A webapp to search emails as an unikernel
2025-12-04We are pleased to announce the development of our new unikernel:
blame. It is a unikernel that allows you to launch a webapp for
searching emails from an archive.
If you want to follow this article, we recommend reading two other articles:
blame is therefore the synthesis of these two works, but in the form of and
unikernel. We will therefore elaborate on this part a little more and show what
the new era of unikernels in OCaml could be.
How to compile & use blame?
blame is part of a series of projects that are trying out a new workflow in
unikernel development. This project no longer requires the mirage tool and
requires OCaml 5, using effects and Miou as a scheduler. In this sense,
it is quite different from what is currently available in terms of unikernels
in OCaml. We will therefore take a detailed look at how to build such a
unikernel and how it was developed.
There are three main constraints for building a unikernel:
- All C sources must be compiled with certain options (notably the
-ffreestandingoption and include Solo5 headers). - The POSIX interface is not available (and therefore neither is the
unix.cmxamodule). - the linking for object files is different from the one used to produce a simple executable (we notably do a static link)
Given these constraints, producing a unikernel from OCaml code requires a
compilation context (or a toolchain). Fortunately, dune allows you to
describe such a context, and we have been offering such a toolchain for quite
some time: ocaml-solo5. This means that building a unikernel
essentially involves compiling a project (always bearing in mind that the
Unix module does not exist) with our modified OCaml compiler (so that the
compilation of C files has the expected options and the linking corresponds to
the construction of a unikernel).
If you are interested, slides on this topic are available here.
The problem is that there are projects (such as mirage-crypto)
that contain C files. If we try to link mirage_crypto.cmxa available in our
OPAM switch with our unikernel, there would be an error (since the C files
would not have been compiled with our modified OCaml compiler but instead with
the host compiler).
This is where dune offers us another solution: we can "vendor" the
dependencies. So, the first step in building blame is to download the sources
of the dependencies.
The fundamental difference with the mirage tool
At this point, experts might say that this is what the mirage tool does.
However, there is a subtle (yet very problematic) step that the mirage tool
performs that we no longer do.
blame is a unikernel. It is not an executable and is a unikernel specific to
Solo5. In this case, the dependencies needed to build blame are known "in
advance": we know that we will use mirage-crypto-rng-mkernel or vifu.
The mirage tool had a step that consists of choosing what we wanted to
produce and, based on that choice, resolving the dependencies needed for the
unikernel. For blame, that is no longer the case: there is no longer a
resolution phase.
In reality, we leave this phase to opam, which means that attempting to
install blame means letting opam resolve the dependencies needed for
blame (as it can do for any software available in opam-repository). The
sources of these dependencies (installed) can be re-obtained using opam source (because, let's not forget, we cannot use the mirage_crypto.cmxa from
our switch but recompile it with our compiler).
Ultimately, the idea is essentially to remove the opam monorepo tool used by
mirage, which never really gained consensus and caused problems in the
unikernel development workflow.
Fetch and dune build
Ultimately, there is a small script in blame that simply calls opam source.
Finally, it just consists of calling dune build (as we do for any dune
project):
$ git clone https://git.robur.coop/robur/blame.git
$ cd blame
$ opam pin add --yes --no-action .
$ opam install --deps-only blame
$ ./source.sh
Successfully extracted to /home/dinosaure/dev/blame/vendors/bstr
Successfully extracted to /home/dinosaure/dev/blame/vendors/digestif
Successfully extracted to /home/dinosaure/dev/blame/vendors/vifu
...
$ dune build
$ file _build/solo5/main.exe
_build/solo5/main.exe: ELF 64-bit LSB executable, x86-64, version 1 (SYSV),
statically linked, interpreter /nonexistent/solo5/, for OpenBSD,
with debug_info, not stripped
Run an unikernel
This is where most people (myself included) have always struggled: running a unikernel. A unikernel is not an executable and needs to be allocated a CPU and memory to run (just like a real operating system). However, our unikernel should be able to:
- communicate via protocols (such as http)
- manipulate an email archive
We are talking here about devices, and Solo5 offers two types: net devices
(which can be thought of as Ethernet ports) and block devices (which can be
thought of as hard drives).
The second is easy to instantiate; it can be a simple file, the only constraint
being that it must be block aligned. In this case, since our article on our
archiving system, blaze allows you to create an archive from a list
of emails:
$ opam install blaze
$ blaze pack make -o pack.pack --align 512 <<EOF
001.eml
002.eml
003.eml
...
EOF
As for the net device, this is a bit more complicated; we need to allocate a
tap interface tap0 and a bridge service to which we would connect our
tap0 interface:
sudo ip link add name service type bridge
sudo ip addr add 10.0.0.1/24 dev service
sudo ip tuntap add name tap0 mode tap
sudo ip link set tap0 master service
sudo ip link set service up
sudo ip link set tap0 up
In the second line we also setup a local network 10.0.0.x/24, and configure the
host system to have 10.0.0.1. Our unikernel will connect to the tap0
interface (to be connected to our local network).
Then launch our unikernel using our Solo5 tender:
$ solo5-hvt --mem=512 --net:service=tap0 --block:archive=pack.pack -- \
_build/solo5/main.exe --ipv4=10.0.0.2/24 --color=always
As such, we can recommend installing and using Albatross to deploy unikernels, as well as builds.robur.coop, which offers a series of downloadable unikernels.
Vif: a web framework for unikernels in OCaml
This is where we can introduce vifu: our OCaml web framework for unikernels.
It is simply a specialisation of our vif web
framework, currently used by our website
builds.robur.coop! vif{,u} focuses on three points:
- Providing a web framework that works with our Miou schedulerâthere are
benchmarks on this subject, and our latest release of
Miou improves them, notably by now using
poll(2)/ppoll(2). You can see more details here. - Offering typed routes that can also be paths based on regular expressions. Thanks to the work of Gabriel Radanne and our feedback, we have managed to create a very expressive DSL for building routes.
- Type the content of requests and handle the serialisation/deserialisation of
application/jsonormultipart/form-datacontent (which allows us to offer typed forms). The idea here is essentially to manipulate OCaml values rather than letting the user manage this not-so-interesting part.
A tutorial showing all these aspects is available here and allows you to code a chat room in less than 300 lines of code!
We have gone even further by offering mnet, which implements the IPv4
layer required for unikernels. You can find more details on this in this
article, as well as utcp, which is a formalized
implementation of the TCP layer, and mhttp, which is simply a
specialisation of httpcats but for unikernels!
Thus, launching a web server in the form of a unikernel essentially consists of:
let ( let@ ) finally fn = Fun.protect ~finally fn
let run _ cidr gateway port =
let devices =
let open Mkernel in
[ Mnet.stackv4 ~name:"service" ?gateway cidr
(* We instantiate a TCP/IPv4 stack *) ] in
Mkernel.run (* like [Miou.run] *) devices @@ fun (daemon, tcpv4, _udpv4) () ->
let rng = Mirage_crypto_rng_mkernel.initialize (module RNG) in
let@ () = fun () ->
Mirage_crypto_rng_mkernel.kill rng;
Mnet.kill daemon in
let cfg = Vifu.Config.v port in
(* We create a configuration to listen on a specific [port]. *)
let routes = [ ... ] (* Routes like [vif]. *) in
Vifu.run ~cfg tcpv4 routes ()
By launching the unikernel on 10.0.0.2 (as above) and with port = 80, you
should be able to access your website via http://10.0.0.2/!
backend & frontend
When it comes to developing a webapp, we often talk about the backend (the http
server) and frontend (JavaScript scripts). As you can imagine, our profiles are
more "backend" (even though we can boast of being true full-stack engineers!),
but that doesn't mean we can't develop a "frontend", particularly thanks to
js_of_ocaml. In this case, blame is a project for which we
had to develop both sides.
The question remains: how can the frontend be served by the backend (when the latter is a unikernel)? This is where the compilation pipeline becomes interesting.
This is where the magic of dune comes into play, which consists of:
- specifying that one of our modules must be compiled with
js_of_ocaml - specifying that the result of this compilation must be handled by a program that serialises the JavaScript script obtained into something that can be statically linked to the unikernel
- obtaining the content of this script on the unikernel side
In our ecosystem, there is an old piece of software called
ocaml-crunch that allows us to serialise the contents of a
folder into an OCaml module. This module can then be linked with an executable
(which can access the content without even performing I/O).
ocaml-crunch was developed in the context of mirage to still obtain the
contents of files even though a unikernel has no concept of what a file system
is (remember that a unikernel can only manipulate net devices or block
devices!). But let's put our separation from the mirage tool behind us and
just offer a tool that serialises the files given as arguments to an OCaml
module:
let contents_of_filename filename =
let ic = open_in_bin filename in
let finally () = close_in ic in
Fun.protect ~finally @@ fun () ->
let tmp = Bytes.create 0x7ff in
let buf = Buffer.create 0x7ff in
let rec go () =
match input ic tmp 0 (Bytes.length tmp) with
| 0 -> Buffer.contents buf
| len ->
Buffer.add_subbytes buf tmp 0 len;
go ()
| exception End_of_file -> Buffer.contents buf
in
go ()
let run _quiet cfg filenames output =
let ppf, finally =
match output with
| None -> (Fmt.stdout, ignore)
| Some filename ->
let oc = open_out_bin filename in
let ppf = Format.formatter_of_out_channel oc in
let finally () = close_out oc in
(ppf, finally)
in
Fun.protect ~finally @@ fun () ->
let fn (filename, name) =
let str = contents_of_filename filename in
Fmt.pf ppf "let %s = @[<hov>%a@]\n%!" name (Hxd_string.pp cfg) str
in
List.iter fn filenames
Here, we use hxd, which produces a very nice hexadecimal output
(aligned with our formatter box). This program essentially consists of producing
an OCaml module that we could reuse in our unikernel:
$ dune exec ./crunch.exe -- --with-comments -f script_js:_build/default/script.bc.js
let script_js = [| "\x2f\x2f\x20\x47\x65\x6e\x65\x72\x61\x74\x65\x64\x20\x62\x79\x20" (* // Generated by *)
; "\x6a\x73\x5f\x6f\x66\x5f\x6f\x63\x61\x6d\x6c\x0a\x2f\x2f\x23\x20" (* js_of_ocaml.//# *)
; "\x62\x75\x69\x6c\x64\x49\x6e\x66\x6f\x3a\x65\x66\x66\x65\x63\x74" (* buildInfo:effect *)
; "\x73\x3d\x64\x69\x73\x61\x62\x6c\x65\x64\x2c\x20\x6b\x69\x6e\x64" (* s=disabled, kind *)
; "\x3d\x65\x78\x65\x2c\x20\x75\x73\x65\x2d\x6a\x73\x2d\x73\x74\x72" (* =exe, use-js-str *)
; "\x69\x6e\x67\x3d\x74\x72\x75\x65\x2c\x20\x76\x65\x72\x73\x69\x6f" (* ing=true, versio *)
; "\x6e\x3d\x36\x2e\x32\x2e\x30\x0a\x28\x66\x75\x6e\x63\x74\x69\x6f" (* n=6.2.0.(functio *)
...
; "\x29\x29\x3b\x0a" |] (* ));. *)
We just need to add two dune rules and add our new module to the modules field
of our unikernel.
(executable
(name script)
(modules script)
(modes js))
(rule
(targets documents.ml)
(deps script.bc.js)
(action
(with-stdout-to
documents.ml
(run
%{exe:crunch.exe}
--with-comments
--file script_js:script.bc.js))))
(executable
(name main)
(modules main documents)
(link_flags :standard -cclib "-z solo5-abi=hvt")
...)
On the unikernel side, we can finally directly access our Documents.script_js
and serve it to our clients via http:
let from_documents ~mime contents =
(* we compute an Etag *)
let hash =
let rec go ctx idx =
if idx >= Array.length contents then Digestif.SHA1.(to_hex (get ctx))
else go (Digestif.SHA1.feed_string ctx contents.(idx)) (succ idx)
in
go Digestif.SHA1.empty 0
in
fun req _server () ->
let open Vifu.Response.Syntax in
let* () = Vifu.Response.add ~field:"content-type" mime in
let hdrs = Vifu.Request.headers req in
let if_none_match =
match Vifu.Headers.get hdrs "if-none-match" with
| Some hash' -> String.equal hash' hash
| None -> false
in
if if_none_match then
(* we use the cache of the webbrowser *)
let* () = Vifu.Response.empty in
Vifu.Response.respond `Not_modified
else
(* we just transmit the contents of our document *)
let from = Flux.Source.array contents in
let* () = Vifu.Response.add ~field:"Etag" hash in
let* () = Vifu.Response.with_source req ~compression:`DEFLATE from in
Vifu.Response.respond `OK
(* we do that for our script.js generated by js_of_ocaml *)
let script = from_documents ~mime:"application/javascript" Documents.script_js
let run _ cidr gateway port =
...
let routes =
let open Vifu.Route in
let open Vifu.Uri in
(* and we just need add our new route *)
[ get (rel / "script.js" /?? any) --> script ] in
Vifu.run ~cfg tcpv4 routes ()
Archive and memory consumption
This is perhaps where it matters most. After playing around with dune and
js_of_ocaml and setting up a compilation pipeline to produce a unikernel that
acts as a backend but can also serve the scripts needed for the frontend, we can
talk about another major constraint for unikernels: memory!
OCaml has a garbage collector, but it can be easily overwhelmed, ultimately exceeding its authorised usage. In general, we prefer unikernels that use as little memory as possible. This can range from 32MB to 512MB (which is starting to be enormous for a unikernel).
In our article on archiving, we handled a large number of emails (around 30k) and archives of around 150MB. It is fine for these archives to be in the form of a block device, as we do not necessarily need to load our entire archive into memory in order to read our emails.
If we thought this way, our unikernel would struggle to handle archives of this size, and we need to be smart about what we really want to get out of this archive without putting pressure on the GC. So loading all the emails into memory in order to analyse them is clearly not an option.
However, in our previous article on our search engine (and this is what this unikernel is about), it is still necessary to have at least the frequencies of the root words for each document in order to calculate their scores for each search query.
The first test of such memory usage for an archive of 20k documents was impossible with a unikernel having 512MB of memory... So let's rethink things a little. Since it seems so simple to be consistent in the construction of our unikernel between the frontend and the backend, could we not move the problem to the client side? A unikernel should use at most 512MB of memory, while a client has a web browser such as Firefox or Chromium that already uses GB of memory đ!
Transmitting informations between backend and frontend
As we said about Vif, the web framework attempts to manage the serialization
and deserialization of information for youâin addition, via the JSON format.
With regard to blame, the server will essentially try to transmit documents
to the JavaScript script, and since we have a consistent
compilation pipeline, blame is structured so that:
- there is a library describing the format of our information using
jsont - this library is used on the frontend to deserialize the information
- this library is used on the backend to serialize the information from what we can retrieve from our email archive
Our common format:
type 'uid document =
{ length : int
; mail : 'uid
; blob : 'uid
; tokens : (string * int) list }
let token =
let open Jsont in
let stem = Object.mem "stem" ~enc:(fun (a, _) -> a) string in
let count = Object.mem "count" ~enc:(fun (_, b) -> b) int in
let fn stem count = (stem, count) in
Object.map fn |> stem |> count |> Object.finish
let stem ~uid =
let open Jsont in
let length = Object.mem "length" ~enc:(fun t -> t.length) int in
let mail = Object.mem "mail" ~enc:(fun t -> t.mail) uid in
let blob = Object.mem "blob" ~enc:(fun t -> t.blob) uid in
let tokens = Object.mem "tokens" ~enc:(fun t -> t.tokens) (list token) in
let fn length mail blob tokens = { length; mail; blob; tokens } in
Object.map fn |> length |> mail |> blob |> tokens |> Object.finish
The request handlers to download a certain number of stem documents:
let stem_of_uid pack uid =
let size = Carton.size_of_uid pack ~uid Carton.Size.zero in
let blob = Carton.Blob.make ~size in
let value = Carton.of_uid pack blob ~uid in
match Carton.Value.kind value with
| `A | `B | `D -> Fmt.invalid_arg "Invalid stem object"
| `C ->
let str = Carton.Value.string value in
let stem = Stem.of_string str in
let mail, blob, length, tbl = Result.get_ok stem in
let tokens = List.of_seq (Hashtbl.to_seq tbl) in
let mail = Carton.Uid.unsafe_of_string mail
and blob = Carton.Uid.unsafe_of_string blob in
{ Format.mail; blob; length; tokens }
let pstem pack req _server () =
let open Vifu.Response.Syntax in
try
match Vifu.Request.of_json req with
| Ok uids ->
let pack = Carton.copy pack in
let ts = List.map (stem_of_uid pack) uids in
let fmt = Format.stem ~uid:juid in
(* serialization *)
let* () = Vifu.Response.with_json req (Jsont.list fmt) ts in
Vifu.Response.respond `OK
| Error _ ->
let* () = Vifu.Response.with_text req "Invalid JSON object!\n" in
Vifu.Response.respond `Bad_request
with _exn ->
let* () = Vifu.Response.with_text req str in
Vifu.Response.respond `Not_found
The frontend which download stem documents according a list of unique ID:
let download uids =
let fn uids =
let method' = jstrf "POST"
and json = Jsont_brr.encode Jsont.(list string) uids
and headers =
Brr_io.Fetch.Headers.of_assoc
[ (jstrf "Content-Type", jstrf "application/json") ]
in
let json = Result.get_ok json in
let body = Brr_io.Fetch.Body.of_jstr json in
let init = Brr_io.Fetch.Request.init ~body ~headers ~method' () in
let req = Brr_io.Fetch.Request.v ~init (jstrf "/stems") in
let* resp = Brr_io.Fetch.request req in
let body = Brr_io.Fetch.Response.as_body resp in
let* stems = Brr_io.Fetch.Body.json body in
let fmt = Format.stem ~uid:Jsont.string in
(* deserialization *)
let stems = Jsont_brr.decode_jv (Jsont.list fmt) stems in
Fut.return stems
in
let stems = bulk 50 uids (* download documents 50 at a time *) in
let documents = List.map fn stems in
let* documents = Fut.of_list documents |> Fut.map Result.ok in
let documents = List.filter_map Result.to_option documents in
Fut.return (Ok (List.flatten documents))
Results
And that's how we shift the entire memory load of emails to the client side! The metric we have is that a unikernel with 256MB of memory can handle ~30k emails (and a 176M archive) using only ~140MB memory!
In our previous article, we suggested pre-calculating email occurrences and saving this calculation in the archive. We decided to do this to further reduce the unikernel's memory usage (and its boot time at the same time).
Once the script has downloaded everything, it just needs to make one last
request so that the unikernel stems the request and the script can score the
documents according to the stemmed query. This calculation is not that
difficult, it's just a List.map on the ~20k documents that the web browser
has and can do quite easily.
Here is a small GIF to show the result on our ~30k e-mails!

The caml-list archive
One of our goals is to eventually offer blame for the caml-list. The most
interesting source I could find was carefully proposed by Nicolås Ojeda Bär
here.
We therefore tried to transform this repository into an archive, and the metrics are interesting. The first thing to note is that the archive contains a total of 65837 emails that Mr.MIME was able to parse. Only 15 emails could not be processed for various reasons (but legitimate ones, as these emails are completely invalid according to the RFCs and even your email client would not be able to display them correctly!).
It should be noted that these errors did not occur when we ran our tests on LKML (the French did not know, at the time of ISO-8859, how to write valid emails...).
The archive is 217M for a total of 427M emails. The ratio is therefore 2, but, and this is the big change since our last article on our archiving system, the archive contains not only emails but also dictionaries associated with the documents (which contain the frequency of words) so that these calculations do not have to be redone when searching for an email.
As for documents, we have 69094 (yes, more than emails, because an email can contain several parts!). These are only text documents (we ignore PGP keys, images, etc.).
That's pretty much an overview of our caml-list archive!
Searching with our unikernel
We then tried running blame with this archive. The unikernel seems to require
only ~350M of memory, which was to be expected. The JavaScript script starts to
slow down quite a bit when handling these ~65k emails, but it's still "fairly"
acceptable (though there are surely ways to improve the user experience). So we
tested it a little and searched for a few emails. Among them, we found this
email from Julian Assange talking about the old OCaml licence:
All this legalise is very interesting, but in the end, as with all legalise there is only one important issue:
a) Who has the motivation, resources and standing to fuck with you?
Hint: it's not INRIA.
But going further, we also found this email from Xavier Leroy regarding www.ocaml.org:
Julian Assange, the creator and maintainer of the www.ocaml.org website, is looking for a new maintainer to take care of that site.
As I understand it, this is a light job: just update the "News" section and add or update some links when appropriate. However, in the interest of continuity, it would be good if the new maintainer could commit for, say, a couple of years.
We (the OCaml team) could do the maintenance if we had to, but I like the idea of a Caml site that is run by enthusiastic users...
If there are any volunteers, please contact me and I'll put you in touch with Julian.
And, of course, many thanks to Julian for his efforts and support.
- Xavier Leroy
In short, this is perhaps the most important thing and what we wanted to achieve in the first place: to preserve a certain heritage full of knowledge such as this:
I was constructing a binary heap of tuples the other day. After pondering these options, I just used Obj.magic 0 as the null value in the array. The heap was in a module, so nothing else could see the array, and I could prove that the code never accessed the null elements, so the use of Obj.magic seemed justified.
In other terms:
" I was walking in the city the other day. I saw a syringe lying on the sidewalk. I stuck the needle in my forearm. That was a classy neighborhood, so the use of the syringe seemed justified. "
Sorry for being sarcastic, but I strongly feel that any suggestion to use Obj functions should be avoided on this list. The OCaml compiler performs some type-dependent optimizations that can result in incorrect code (w.r.t. GC invariants) if wrong types are given using Obj.magic.
- Xavier Leroy
Conclusion
blame is still experimental because it relies on libraries such as
mnet, utcp, vifu, and mkernel, which
are not yet fully finalized. However, at this stage, it has been very enjoyable
to develop a website in OCaml in the form of a unikernel, to transmit
information from one side of the web app to the other, and to avoid
Out_of_memory errors even when handling large amounts of information.
It is also the culmination of a long process in which we achieved several objectives, such as:
- making unikernel development more enjoyable
- having the necessary tools (like
memtrace) to monitor the memory used by the unikernel and avoid leaks (which we observed withmirage-tcpip) - using
cstructless and less (and replacing it with strings) - using fewer and fewer functors
It is also interesting to look beyond the limits in terms of memory usage. From
our observations, we can see that the different stacks (ip, tcp, or http) do
not have any memory leaks (unlike mirage-tcpip). Instead, we observe "stable"
memory usage, where our unikernel allocates everything at once and does not end
up using more memory than it has. From what we know about GC, it attempts to
maintain a ratio between live data and the space available for new allocations.
Of course, GC is not aware of the unikernel's limits (and this may be the major flaw in OCaml 5 at present), but these experiments allow us to go further and with greater confidence in the development of services in the form of unikernels.
So, happy hacking "discriminating hackers"!