A stem engine and a search engine for OCaml

2025-10-29

In our previous article about our email archiving system, we floated the idea of creating an email search engine. And now we've done it! In this article, we'll take a look back at the development process in order to rummage through emails related to the development of kvm (in order to handle a fairly large and public corpus). The ultimate goal is to have a unikernel that provides, on demand, a search engine with a given archive.

Fetch the beast

The first objective is to obtain a public-inbox archive, which corresponds to git clone a repository with a specific layout. My experience with Git allows me to code a small git clone in OCaml with SSH in just a few steps. With this, I can extend my blaze fetch tool, which could retrieve emails from a POP3 server. It can now retrieve emails from a public-inbox archive:

$ mkdir /tmp/kvm.1
$ blaze fetch --format="/tmp/kvm.1" git@localhost:kvm.1 > emails.txt
$ wc -l emails.txt
35684

Our /tmp/kvm.1 folder is now full of emails! We have ~35k emails, which seems sufficient to me if we want to have a good research corpus. For our search engine, we would like to use the archive system we presented in our previous article. To do this, we will simply pack our emails into a single file:

$ blaze pack make -o pack.pack < emails.txt
...
$ ls -sh pack.pack
156M pack.pack
$ ls -sh ~/kvm.1/.git/object/pack-*.pack
172M ~/kvm.1/.git/object/pack-*.pack

Our compression ratio remains slightly better than public-inbox. We added a post-scriptum on this subject, and this has been confirmed once again.

However, even though our format focuses on content by splitting parts of an email into several objects (rather than just the email itself), the structure we currently use is still more concerned with the isomorphism between the serialisation of our emails in a pack file and their deserialisation.

As far as the search engine is concerned, we need another view of our emails in order to ignore the parts that do not interest us. For example, we would not want to apply our search engine to a part containing a base64-encoded image, for example...

To formalise this other view a little more, we first need to understand what a search engine needs and how it works. So we'll put our archive aside for now and start by introducing you to stemming.

Stemming algorithm

A very long time ago, I had the opportunity to do an internship with LIRMM, where a research team was working on natural language analysis. This was well before AI, but it allowed me to read a whole bunch of research papers on the subject.

Tokenisation, morphosyntactic tagging, and Chomsky's hierarchy held no secrets for me, but I ended up working with OCaml. However, even back then, I understood the difficulties involved, particularly in terms of balancing the relevance of a result with the cost the process could have (in terms of calculations and memory). I loved the example of wine, which can be pronounced as red in French, but which primarily refers to colour. There was a whole field concerning the contextualisation of words, but that's not what we're going to talk about!

Let's remain a little more pragmatic in our approach. Our use of a search engine is more about searching for concepts: concepts that can be expressed with words. The difficulty is that there may be derivations of these words: "du vin rouge", "des champignons rouges". If we search for the concept "rouge", we would want to find documents with the word "rouge" as well as the word "rouges".

We would need to have a somewhat unique representation of the concept of "rouge" and ignore the subtleties of language, which may or may not add an "s" to words. We therefore need a process that allows us to obtain a unique representation of a concept that is embodied by a word. This process would necessarily depend on the language: the famous "s" that is added is specific to French, but what about English?

This process is called stemming because some¹ of our languages are based on the idea that a word is fundamentally composed of a root that could be the purest material form of our concept.

1: Stemming remains a fairly Western-centric approach. We can take other languages such as Chinese or Japanese, which are not constructed in the same way. This always reminds me that the internet, protocols and formats are always the product of social reality. Yuscii could have saved us!

Snowball

Other people, more competent than me, have obviously already asked themselves this question. A project called Snowball exists and aggregates this type of process for a whole range of languages, such as French and English.

Snowball is based on an algorithm, Porter's algorithm, which attempts to refine words to obtain their root form. Here too, let's be pragmatic: the objective is not to be accurate and actually obtain the root form of a word (in the linguistic sense), but rather to obtain a unique form for each concept so that we can differentiate between them.

Based on Porter's algorithm, at least as Snowball presents it, we can imagine a language that describes a process for refining words: the Snowball language.

Our first attempt was to recreate a compiler for this language but in OCaml (instead of coding it in C): the intuition is that OCaml is good enough for this.

But the time it takes to learn OCaml caught up with me, so I decided to make a compiler with GADTs (as I was able to do with mirage-lambda). To be honest, it was shooting myself in the foot from the start, especially after discovering some rather bizarre grammatical constructions in the language.

Snowball & unikernels

So let's remain pragmatic. Snowball is a fairly good piece of software that distributes very well across platforms. In this case, the only point on which we should have reconsidered Snowball was if the stemming algorithms generated in C needed the POSIX API (such as open(2)) to work. And that's not the case!

In other words, we can easily use these C implementations in a unikernel!

Stem, an OCaml wrapper for `libstemmer`

So we'll be quite direct: we'll simply offer a library that allows these algorithms (even though they are written in C) to be used in OCaml. We have therefore developed a small stem library that allows words to be radicalised:

$ opam install stem
$ stem.stemmer --language french -
rouge
roug
rouges
roug

As we can see, our word "rouge" can be lint into its root "roug" (the concept). We can derive other words from this root, such as "rouges", which brings us back to our root "roug".

"roug" is not a real word, nor is it a root word in the linguistic sense, but once again, the aim is not to be correct but to be able to differentiate the concepts from each other.

Useless words

Concepts can be obtained, however, there are certain words that have no meaning. In French, some are called coordinating conjunctions ("mais ou et donc or ni car"). All these words exist only to link concepts but have no meaning on their own.

Xapian, a database system that allows searching by words, has a list of words that can be deleted: they call these words stopwords. It's a fairly arbitrary list, but our languages are just as arbitrary...

Tokenization

We can now obtain the root of a word and filter out words that are useful or useless to us, but we need a way to extract these words from a document. This is called tokenisation.

Here too, approaches can range from very complex to very simple. In this case, we will partially follow a pre-tokenisation implementation used for the BERT algorithm: tokenisers. I am referring to pre-tokenisation here because the BERT model goes much further, but in our case, we only need to split the words. This will therefore mainly be done on spaces and punctuation.

This is also where we can see the differences with other languages. With regard to Japanese, we can mention the CJK tokeniser.

It is now possible to obtain the frequency of occurrence of a concept according to language in un buon documento:

$ cat >file.txt<<EOF
Une bonne entrecôte avec des champignons et du bon vin rouge.
EOF
$ stem.ts --language french file.txt
"bon",2
"roug",1
"vin",1
"champignon",1
"entrec\195\180t",1

Search engine

@art-w mentioned to me over a drink an interesting algorithm that I had already come across in my research: TF-IDF. In his words: it's an equation that can be intimidating, but it works. There is a whole bibliography on the subject, and the equation may have evolved since the 1970s. One implementation that came up during my research is the Okapi BM25 algorithm.

Once again, the equation may seem daunting, but it only concerns an aggregate of frequencies that we end up inverting according to the size of our corpus.

We have therefore implemented an algorithm that takes advantage of parallelisation with OCaml 5 and Miou to calculate the frequencies of each document. We then merge the results to obtain a "bag of inverse frequencies" and an average of the document sizes.

With this information, we can calculate the relevance of a document in relation to a series of "concepts" (our stemmed words). Pretty soon, we have our little search engine!

$ opam install bm25
$ echo "Le chat noir sort d'un bar" > file0.txt                                                           
$ echo "Un chat est coincé dans un arbre" > file1.txt                                                     
$ echo "Je dois sortir le chien" > file2.txt                                                             
$ echo "Ils ont un drapeau noir, en berne sur l'espoir, les anarchistes" > file3.txt
$ okapi -d file0.txt -d file1.txt -d file2.txt -d file3.txt "chat noir"
file0.txt: 1.405460
file1.txt: 0.790116
file3.txt: 0.527655
file2.txt: 0.000000

Conclusion about search engine

At this stage, I do have one concern. If we take a step back and look at our "pipeline" (stemming, tokeniser, search engine), we can legitimately find cases where the results should not match what we expect.

In this case, acronyms (such as... ptt) are not handled. Punctuation can also be a real problem ("Robur's project" becomes ["robur"; "s"; "project"] or ["robur"; "project"]?). Finally, Okapi BM25 remains an estimate of relevance...

Above all, this shows that it is still (since my internship at LIRMM, after... 14 years) a fairly active area of research. AI has, of course, changed the game since then.

But there is certainly (and this was the starting point) a question of practicality. It would be difficult to get it right every time. The only question remains: what margin of error do we ultimately allow ourselves to have in our results? The advantage of our pipeline is that it remains fairly simple and we can understand why it doesn't work in a particular case.

I think that's the big difference with AI models, where retro-engineering is quite complex. In short, we don't need to be that relevant, and we can already have interesting software without following trends.

Emails and semantic view

Let's go back to our corpus and try to apply our search engine. Formally, we need a language managed by our stem library and we need a view of our emails that only includes what might be interesting to give to our search engine: text!

So we'll use Content-Type and possibly Content-Language! With these two elements, we can obtain a new view of our email that only contains text that can be processed by our search engine.

type 'octet document =
  | Leaf of   { mime : (* text/ *) string
              ; lang : Snowball.Language.t
              ; contents : 'octet }
  | Choose of { mime : (* multipart/ *) string
              ; parts : 'octet document list }

and 'octet t = 'octet document option

The idea is to aggregate only those parts that have a MIME type text/* (such as text/plain or text/html). Then, we must necessarily keep the "structure" of our email, especially if it contains multipart/* parts.

This is more or less similar to the type we described for archiving our emails. However, this structure focuses primarily on what can be extracted from an email for our search engine. We will do this recognition work upstream and save it in our archive to prepare the ground so that we do not have to "analyse" the structure of our emails during the search stage.

To understand the subtle difference, we have created a tool that describes the structure of an email as well as its "semantic" view:

$ blaze pack get pack.idx c1b9bdff6fc84f827093c8d1c57a0c6b259be630 > foo.eml
$ blaze descr foo.eml
┌── signed     
│   ├── text/plain
│   └── application/pgp-signature
$ blaze pack descr pack.idx c1b9bdff6fc84f827093c8d1c57a0c6b259be630
skeleton:                             
┌── <multipart>─┐
│   ├── 9b767fad337f1f787d0199cfd44e90b8efae3239
│   └── ba5fda5b17f43accdc31bfeb5657ce2c8efc3083
semantic:
┌── <signed>─┐
│   └── (plain) 9b767fad337f1f787d0199cfd44e90b8efae3239

In this example, we can clearly see that the "semantic view" does not take into account the PGP signature of the email. Remember? We wouldn't want to search for parts such as images that are base64-encoded!

In this section, the main focus is on expanding our archiving system to incorporate this new view. We have tested its impact, particularly on the size of the PACK file, and it seems that its impact is really marginal (around 1-2MB extra for our ~35k emails).

What is interesting is that we ultimately pre-calculate our corpus for our search engine, which is not our email corpus. We therefore exclude:

attachments such as signatures, images, etc.
we also exclude content whose language we cannot process (according to libstemmer)

Mix them all!

So now we're going to test all of this. Our programme essentially consists of aggregating the documents (according to our new "semantic" view), applying libstemmer and calculating the frequencies of occurrence of the stemmed words. Finally, we need to calculate the IDF.

At this stage, we may have an opportunity (not yet seized, but which should be soon). The archiving system is basically read-only, and the calculation presented above can be quite significant. It can of course be improved through parallelisation. But we could also very well imagine serialising the result of these calculations in a "pack.stem" file!

Finally, the last step is to process the user's query and rate the documents according to that query.

To demonstrate the result, we will search for a very specific occurrence, sorry for my language, "fuck". It is said that we should find this type of word quite easily. Let's see who to blame!

$ blaze search pack.idx "fuck"
1bfe69512ee7de5f6acc60f15d8951cf5eae2938: 12.098630
7a534344b98c8e398963dd662e1bd2e08af81213: 11.184771
eed0059577d7825b3565e3df5ae68c438102a455: 8.681113
6ef35b0995e30c47b1f392d317fbaff8e7743c31: 8.368890
4a26b585318c9761db45b313eff5edfb8cd36456: 7.416276
fffaa96708ece505a8f1247b9d537e9845708f23: 0.000000
...
$ blaze pack get pack.idx 1bfe69512ee7de5f6acc60f15d8951cf5eae2938 | tail -n 4
                                      
No, and can people please stop suggesting it?  That macro is so fucking
pointless that it's revolting,

Well, we don't have many emails with that word. But this example is just as interesting because we can see the value of libstemmer here: the author uses "fucking" and we wanted to search for the word "fuck".

But this shows that at this stage, we do have a small search engine for our emails! For your information, our ~35k emails contain approximately ~20k documents. The search algorithm will therefore be applied to all of these documents (which is quite a large number).

Conclusion & next steps

At this stage, we are beginning to see the pieces of the puzzle fall into place. We have two unikernels that allow us to create a mailing list with all the necessary security measures to exchange with other services (such as Gmail).

We have our email archiving system, which will allow us to issue an archive of our mailing list.

Finally, we have a search engine that works with our archiving system!

As for ptt, all that's missing is the mailing list administration (adding and removing users). But we're not far from achieving that.

In short, tools such as blaze and ptt will soon be available after extensive testing. This piece of the puzzle was all the more important because it will allow us to go further with IMAP (which has the SEARCH command within its protocol).

In short, happy hacking!