Git, Carton and emails

2025-01-07

We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have an overview of these libraries in our announcement on the OCaml forum. This article goes into more detail about the PACK format and its use for archiving your emails.

Back to Git and patches

In our Carton annoucement, we talk about 2 levels of compression for Git objects, which are zlib compression and compression between objects using a patch.

Furthermore, if we have 2 blobs (2 versions of a file), one of which contains 'A' and the other contains 'A+B', the second blob will probably be saved in the form of a patch requiring the contents of the first blob and adding '+B'. At a higher level and according to our use of Git, we understand that this second level of compression is very interesting: we generally just add/remove few lines (like introduce a new function) or delete some (removing code) in our files of our project.

However, there is a bias in what Git does and what we perceive. We often think that when it comes to patching in the case of Git, we think of the patience diff or the Eugene Myers diff. While the latter offer the advantage of readability in terms of knowing what has been added or deleted between two files, they are not necessarily optimal for producing a small patch.

In reality, what interests us in the case of the storage and transmission of these patches over the network is not a form of readability in these patches but an optimality in what can be considered as common between two files and what is not. It is at this stage that the use of duff is introduced.

This is a small library which can generate a patch between two files according to the series of bytes common to both files. We're talking about 'series of bytes' here because these elements common to our two files are not necessary human readable. To find these series of common bytes, we use Rabin's fingerprint algorithm: a rolling hash used since time immemorial.

Patches and emails

So, as far as emails are concerned, it's fairly obvious that there are many common "words" to all your emails. The simple word From: should exist in all your emails.

From this simple idea, we can understand the impact, the headers of your emails are more or less similar and have more or less the same content. The idea of duff, applied to your emails, is to consider these other emails as a "slightly" different version of your first email.

we can store a single raw email
and we build patches of all your other emails from this first one

A fairly concrete example of this compression through patches and emails is 2 notification emails from GitHub: these are quite similar, particularly in the header. Even the content is just as similar: the HTML remains the same, only the commentary differs.

$ carton diff github.01.eml github.02.eml -o patch.diff
$ du -sb github.01.eml github.02.eml patch.diff
9239	github.01.eml
9288	github.02.eml
5136	patch.diff

This example shows that our patch for rebuilding github.02.eml from github.01.eml is almost 2 times smaller in size. In this case, with the PACK format, this patch will also be compressed with zlib (and we can reach ~2900 bytes, so 3 times smaller).

Compress and compress!

To put this into perspective, a compression algorithm like zlib can also have such a ratio (3 times smaller). But the latter also needs to serialise the Huffman tree required for compression (in the general case). What can be observed is that concatenating separately compressed emails makes it difficult to maintain such a ratio. Worse, concatenating all the emails before compression and compressing them all gives us a better ratio!

That's what the PACK file is all about: the aim is to be able to concatenate these compressed emails and keep an interesting overall compression ratio. This is the reason for the patch, to reduce the objects even further so that the impact of the zlib header on all our objects is minimal and, above all, so that we can access to objects without having to decompress the previous ones (as we would have to do for a *.tar.gz archive, for example).

The initial intuition about the emails was right, they do indeed share quite a few elements together and in the end we were able to save ~4000 bytes in our GitHub notification example.

Isomorphism, DKIM and ARC

One attribute that we wanted to pay close attention to throughout our experimentation was "isomorphism". This property is very simple: imagine a function that takes an email as input and transforms it into another value using a method (such as compression). Isomorphism ensures that we can 'undo' this method and obtain exactly the same result again:

  decode(encode(x)) == x

This property is very important for emails because signatures exist in your email and these signatures result from the content of your email. If the email changes, these signatures change too.

For instance, the DKIM signature allows you to sign an email and check its integrity on receipt. ARC (which will be our next objective) also signs your emails, but goes one step further: all the relays that receive your email and send it back to the real destination must add a new ARC signature, just like adding a new block to the Bitcoin blockchain.

So you need to make sure that the way you serialise your email (in a PACK file) doesn't alter the content in order to keep these signatures valid! It just so happens that here too we have a lot of experience with Git. Git has the same constraint with Merkle-Trees and as far as we're concerned, we've developed a library that allows you to generate an encoder and a decoder from a description and that respects the isomorphism property by construction: the encore library.

Then we could store our emails as they are in the PACK file. However, the advantage of duff really comes into play when several objects are similar. In the case of Git, tree objects are similar but they are not similar with commits, for example. For emails, there is also such a distinction: the email headers are similar but they are not similar to the email content.

You can therefore try to "split" emails into 2 parts, the header on one side and the content on the other. We would then have a third value which would tell us how to reconstruct our complete email (i.e. identify where the header is and identify where the content is).

However, after years of reading email RFCs, things are much more complex. Above all, this experience has enabled me to synthesise a skeleton that all emails have:

(* multipart-body :=
      [preamble CRLF]
      --boundary transport-padding CRLF
      part
      ( CRLF --boundary transport-padding CRLF part )*
      CRLF
      --boundary-- transport-padding
      [CRLF epilogue]

   part := headers ( CRLF body )?
*)

type 'octet body =
  | Multipart of 'octet multipart
  | Single of 'octet option
  | Message of 'octet t

and 'octet part = { headers : 'octet; body : 'octet body }

and 'octet multipart =
  { preamble : string
  ; epilogue : string * transport_padding;
  ; boundary : string
  ; parts : (transport_padding * 'octet part) list }

and 'octet t = 'octet part

As you can see, the distinction is not only between the header and the content but also between the parts of an email as soon as it has an attachment. You can also have an email inside an email (and I'm always surprised to see that this particular case is frequent). Finally, there's the annoying preamble and epilogue of an email with several parts, which is often empty but necessary: you always have to ensure isomorphism — even for "useless" bytes, they count for signatures.

We'll therefore need to serialise this structure and all we have to do is transform a string t and SHA1.t t so that our structure no longer contains the actual content of our emails but a unique identifier referring to this content and which will be available in our PACK file.

module Format : sig
  val t : SHA1.t Encore.t
end

let decode =
  let parser = Encore.to_angstrom Format.t in
  Angstrom.parse_string ~consume:All parser str

let encode =
  let emitter = Encore.to_lavoisier Format.t in
  Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter

However, we need to check that the isomorphism is respected. You should be aware that work on Mr. MIME has already been done on this subject with the afl fuzzer: check our assertion x == decode(encode(x)). This ability to check isomorphism using afl has enabled us to use the latter to generate valid random emails. This allows me to reintroduce you to the hamlet project, perhaps the biggest database of valid — but incomprehensible — emails. So we've checked that our encoder/decoder for “splitting” our emails respects isomophism on this million emails.

Carton, POP3 & mbox, some metrics

We can therefore split an email into several parts and calculate an optimal patch between two similar pieces of content. So now you can start packaging! This is where I'm going to reintroduce a tool that hasn't been released yet, but which allows me to go even further with emails: blaze.

This little tool is my Swiss army knife for emails! And it's in this tool that we're going to have fun deriving Carton so that it can manipulate emails rather than Git objects. So we've implemented the very basic POP3 protocol (and thanks to ocaml-tls for offering a free encrypted connection) as well as the mbox format.

Both are not recommended. The first is an old protocol and interacting with Gmail, for example, is very slow. The second is an old, non-standardised format for storing your emails — and unfortunately this may be the format used by your email client. After resolving a few bugs such as the unspecified behaviour of pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up with lots of emails that you'll have fun packaging!

$ mkdir mailbox
$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
  -u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst
$ blaze.pack make -o mailbox.pack mails.lst
$ tar czf mailbox.tar.gz mailbox
$ du -sh mailbox mailbox.pack mailbox.tar.gz
97M     mailbox
28M     mailbox.pack
23M     mailbox.tar.gz

In this example, we download the latest emails from the last 30 days via POP3 and store them in the mailbox/ folder. This folder weighs 97M and if we compress it with gzip, we end up with 23M. The problem is that we need to decompress the mailbox.tar.gz document to extract the emails.

This is where the PACK file comes in handy: it only weighs 28M (so we're very close to what tar and gzip can do) but we can rebuild our emails without unpacking everything:

$ blaze.pack index mailbox.pack
$ blaze.pack list mailbox.pack | head -n1
0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
Delivered-To: romain.calascibetta@gmail.com
...

Like Git, we now associate a hash with our emails and can retrieve them using this hash. Like Git, we also calculate the *.idx file to associate the hash with the position of the email in our PACK file. Just like Git (with git show or git cat-file), we can now access our emails very quickly. So we now have a database system (read-only) for our emails: we can now archive our emails!

Let's have a closer look at this PACK file. We've developed a tool more or less similar to git verify-pack which lists all the objects in our PACK file and, above all, gives us information such as the number of bytes needed to store these objects:

$ blaze.pack verify mailbox.pack
4e9795e268313245f493d9cef1b5ccf30cc92c33 a       12     186    6257b7d4
...
517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b   666027     223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a

It shows the hash of our object, its type (A for the structure of our email, B for the content), its position in the PACK file, the number of bytes used to store the object (!) and finally the depth of the patch, the checksum, and the source of the patch needed to rebuild the object.

Here, our first object is not patched, but the next object is. Note that it only needs 223 bytes in the PACK file. But what is the real size of this object?

$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
  --raw --without-metadata | wc -c
2014

So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of 10! In this case, the object is the content of an email. Guess which one? A GitHub notification! If we go back to our very first example, we saw that we could compress with a ratio close to 2. We could go further with zlib: we compress the patch too. This example allows us to introduce one last feature of PACK files: the depth.

$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
kind:         b
length:       2014 byte(s)
depth:        10
cache misses: 586
cache hits:   0
tree:           000026ab
              Δ 00007f78
              ...
              Δ 0009ef74
              Δ 000a29ab
...

In our example, our object requires a source which, in turn, is a patch requiring another source, and so on (you can see this chain in the tree). The length of this patch chain corresponds to the depth of our object. There is therefore a succession of patches between objects. What Carton tries to do is to find the best patch from a window of possibilities and keep the best candidates for reuse. If we unroll this chain of patches, we find a "base" object (at 0x000026ab) that is just compressed with zlib and the latter is also the content of another GitHub notification email. This shows that Carton is well on its way to finding the best candidate for the patch, which should be similar content, moreover, another GitHub notification.

The idea is to sacrifice a little computing time (in the reconstruction of objects via their patches) to gain in compression ratio. It's fair to say that a very long patch chain can degrade performance. However, there is a limit in Git and Carton: a chain can't be longer than 50. Another point is the search for the candidate source for the patch, which is often physically close to the patch (within a few bytes): reading the PACK file by page (thanks to Cachet) sometimes gives access to 3 or 4 objects, which have a certain chance of being patched together.

Let's take the example of Carton and a Git object:

$ carton get pack-*.idx eaafd737886011ebc28e6208e03767860c22e77d
...
cache misses: 62
cache hits:   758
tree:           160720bb
              Δ 160ae4bc
              Δ 160ae506
              Δ 160ae575
              Δ 160ae5be
              Δ 160ae5fc
              Δ 160ae62f
              Δ 160ae667
              Δ 160ae6a5
              Δ 160ae6db
              Δ 160ae72a
              Δ 160ae766
              Δ 160ae799
              Δ 160ae81e
              Δ 160ae858
              Δ 16289943

We can see here that we had to load 62 pages, but that we also reused the pages we'd already read 758 times. We can also see that the offset of the patches (which can be seen in Tree) is always close (the objects often follow each other).

Mbox and real emails

In a way, the concrete cases we use here are my emails. There may be a fairly simple bias, which is that all these emails have the same destination: romain.calascibetta@gmail.com. This is a common point which can also have a significant impact on compression with duff. We will therefore try another corpus, the archives of certain mailing lists relating to OCaml: lists.ocaml.org-archive

$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
  -o opam-devel.pack
$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
  > opam-devel.mbox.gzip
$ du -sh opam-devel.pack opam-devel.mbox.gzip \
  lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
3.9M    opam-devel.pack
2.0M    opam-devel.mbox.gzip
10M     lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox

The compression ratio is a bit worse than before, but we're still on to something interesting. Here again we can take an object from our PACK file and see how the compression between objects reacts:

$ blaze.pack index opam-devel.pack
...
09bbd28303c8aafafd996b56f9c071a3add7bd92 b  362504   271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
  --without-metadata --raw | wc -c
2098

Once again, we see a ratio of 10! Here the object corresponds to the header of an email. This is compressed with other email headers. This is the situation where the fields are common to all your emails (From, Subject, etc.). Carton successfully patches headers together and email content together.

Next things

All the work done on email archiving is aimed at producing a unikernel (void) that could archive all incoming emails. Finally, this unikernel could send the archive back (via an email!) to those who want it. This is one of our goals for implementing a mailing list in OCaml with unikernels.

Another objective is to create a database system for emails in order to offer two features to the user that we consider important:

quick and easy access to emails
save disk space through compression

With this system, we can extend the method of indexing emails with other information such as the keywords found in the emails. This will enable us, among other things, to create an email search engine!

Conclusion

This milestone in our PTT project was quite long, as we were very interested in metrics such as compression ratio and software execution speed.

The experience we'd gained with emails (in particular with Mr. MIME) enabled us to move a little faster, especially in terms of serializing our emails. Our experience with ocaml-git also enabled us to identify the benefits of the PACK file for emails.

But the development of Miou was particularly helpful in satisfying us in terms of program execution time, thanks to the ability to parallelize certain calculations quite easily.

The format is still a little rough and not quite ready for the development of a keyword-based e-mail indexing system, but it provides a good basis for the rest of our project.

So, if you like what we're doing and want to help, you can make a donation via GitHub or using our IBAN.

Post

This little note is an extension of the feedback we got on the Carton release. nojb, in this case, pointed to the public-inbox software as the archiver of the various Linux kernel mailing lists. The latter is based on the same intuition we had, namely to use the PACK format to archive emails.

The question then arises: are we starting to remake the wheel?

In truth, the devil is in the detail. As it happens, you can download LKML mailing list archives with Git in this way:

$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git
$ cd lkml/git/15.git
$ du -sh objects/pack/pack-*.pack
981M	objects/pack/pack-*.pack
$ cd objects/pack/
$ mkdir loose
$ carton explode 'loose/%s/%s' pack-*.pack
$ du -sh loose/c/
2.7G	loose/c

public-inbox is based not only on the PACK format for email archiving, but also on Git concepts. In this case, such a Git repository actually only contains an m file corresponding to the last email received on the mailing list. The other e-mails are "old versions of this e-mail". In this case, public-inbox considers a certain form of versioning between emails. Each commit is a new email and will "replace" the previous one.

Heuristics to patch

public-inbox then relies on the heuristics implemented by Git to find the best candidate for patching emails. These heuristics are explained here. The idea is to consider a base object (which will be the source of several patches) as the last version of your file (in the case of public-inbox, the last email received) and build patches of previous versions with this base object. The heuristic comes from the spontaneous idea that, when it comes to software files, these grow entropically. The latest version is therefore most likely to contain all the similarities with previous versions.

Once again, when it comes to code, we tend to add code. So we should be able to use all the occurrences available in the latest version of a file to produce patches for earlier versions.

Comparison

Let's have some fun comparing public-inbox and the blaze tool:

            +-------+--------------+------+
            | blaze | public-inbox |  raw |
+-----------+-------+--------------+------+
| caml-list |  160M |         154M | 425M |
+-----------+-------+--------------+------+
| lkml.15   |  1.1G |         981M | 2.7G |
+-----------+-------+--------------+------+
| kvm.0     |  1.2G |         1.1G | 3.1G |
+-----------+-------+--------------+------+

The first thing you'll notice is that blaze produces PACK files that are a little larger than those produced by Git. The problem is that blaze doesn't store exactly the same thing! The emails it stores are emails with lines ending in \r\n, whereas public-inbox stores emails with \n. It may just be a small character, but multiplied by the number of lines in an email and the number of emails in the archive, it's got its weight.

It's also true that decompress, the OCaml implementation of zlib, is not as efficient as its C competitor in terms of ratio. So this is disadvantage we have, which is not linked to the way we generate the PACK file (we could replace decompress with zlib!).

However, there's another interesting metric between what we produce and what public-inbox does. It's important to understand that we maintain "some compatibility" with the Git PACK file. The objects aren't the same and don't have the same meaning, but it's still a PACK file. As such, we can use git verify-pack on our archive as on the public-inbox archive:

            +-----------------+------------------------+
            | PACK from blaze | PACK from public-inbox |
+-----------+-----------------+------------------------+
| caml-list |           ~2.5s |                  ~4.1s |
+-----------+-----------------+------------------------+
| lkml.15   |          ~14.7s |                 ~16.3s | 
+-----------+-----------------+------------------------+
| kvm.0     |            ~18s |                   ~21s |
+-----------+-----------------+------------------------+

The analysis of our PACK file is faster than the one of public-inbox. This is where we need to understand what we're trying to store and how we're doing it. When it comes to finding a candidate for a patch, blaze relies solely on the similarities between the two objects/emails they have, whereas public-inbox, via Git heuristics, will still prioritize a patch between emails that follow each other in temporality via "versioning".

The implication is that the last 2 emails may have no similarity at all, but Git/public-inbox will still try to patch them together, as one is the previous version (in terms of time) of the other.

Another aspect is that Git sometimes breaks the patch chain so that, when it comes to extracting an object, if it's a patch, its source shouldn't be very far away in the PACK file. Git prefers to patch an object with a source that may be less good but close to it, rather than keeping the best candidate as the source for all patches. Here too, blaze reacts differently: we try to keep and reuse the best candidate as much as possible.

A final difference, which may also be important, is the way in which emails are stored. We often refer to e-mails as "split", whereas public-inbox only stores them as they are. One implication of this can be the extraction of an attachment. As far as blaze is concerned, we would just have to extract the skeleton of the email, search in the various headers for the desired attachment and extract the attachment as is, which is a full-fledged object in our PACK file.

As for public-inbox, we'd have to extract the email, parse the email, then search for the part containing the attachment according to the header and finally extract the attachment.

Conclusion

If we had to draw a "meta" conclusion from the differences between blaze and public-inbox, it's that our tool focuses on the content of your emails, whereas public-inbox focuses on the historicity of your emails. As such, and in the hope of making an OCaml-based email client, we believe our approach remains interesting.

But these experiments have shown us 2 important things:

we're capable of handling millions of emails, parsing and storing them. It's pretty impressive to see our tool handle almost a million emails (kvm.0) without any bugs!
the second thing is that our initial intuition remains valid. Even if the path seems subtly different from what public-inbox can do, our approach is clearly the right one and keeps us going.