Git, Carton and emails
We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have an overview of these libraries in our announcement on the OCaml forum. This article goes into more detail about the PACK format and its use for archiving your emails.
Back to Git and patches
In our Carton annoucement, we talk about 2 levels of compression for Git objects, which are zlib compression and compression between objects using a patch.
Furthermore, if we have 2 blobs (2 versions of a file), one of which contains 'A' and the other contains 'A+B', the second blob will probably be saved in the form of a patch requiring the contents of the first blob and adding '+B'. At a higher level and according to our use of Git, we understand that this second level of compression is very interesting: we generally just add/remove few lines (like introduce a new function) or delete some (removing code) in our files of our project.
However, there is a bias in what Git does and what we perceive. We often think that when it comes to patching in the case of Git, we think of the patience diff or the Eugene Myers diff. While the latter offer the advantage of readability in terms of knowing what has been added or deleted between two files, they are not necessarily optimal for producing a small patch.
In reality, what interests us in the case of the storage and transmission of these patches over the network is not a form of readability in these patches but an optimality in what can be considered as common between two files and what is not. It is at this stage that the use of duff is introduced.
This is a small library which can generate a patch between two files according to the series of bytes common to both files. We're talking about 'series of bytes' here because these elements common to our two files are not necessary human readable. To find these series of common bytes, we use Rabin's fingerprint algorithm: a rolling hash used since time immemorial.
Patches and emails
So, as far as emails are concerned, it's fairly obvious that there are many
common "words" to all your emails. The simple word From:
should exist in all
your emails.
From this simple idea, we can understand the impact, the headers of your emails
are more or less similar and have more or less the same content. The idea of
duff
, applied to your emails, is to consider these other emails as a
"slightly" different version of your first email.
- we can store a single raw email
- and we build patches of all your other emails from this first one
A fairly concrete example of this compression through patches and emails is 2 notification emails from GitHub: these are quite similar, particularly in the header. Even the content is just as similar: the HTML remains the same, only the commentary differs.
$ carton diff github.01.eml github.02.eml -o patch.diff
$ du -sb github.01.eml github.02.eml patch.diff
9239 github.01.eml
9288 github.02.eml
5136 patch.diff
This example shows that our patch for rebuilding github.02.eml
from
github.01.eml
is almost 2 times smaller in size. In this case, with the PACK
format, this patch will also be compressed with zlib (and we can reach ~2900
bytes, so 3 times smaller).
Compress and compress!
To put this into perspective, a compression algorithm like zlib can also have such a ratio (3 times smaller). But the latter also needs to serialise the Huffman tree required for compression (in the general case). What can be observed is that concatenating separately compressed emails makes it difficult to maintain such a ratio. Worse, concatenating all the emails before compression and compressing them all gives us a better ratio!
That's what the PACK file is all about: the aim is to be able to concatenate
these compressed emails and keep an interesting overall compression ratio. This
is the reason for the patch, to reduce the objects even further so that the
impact of the zlib header on all our objects is minimal and, above
all, so that we can access to objects without having to decompress the
previous ones (as we would have to do for a *.tar.gz
archive, for example).
The initial intuition about the emails was right, they do indeed share quite a few elements together and in the end we were able to save ~4000 bytes in our GitHub notification example.
Isomorphism, DKIM and ARC
One attribute that we wanted to pay close attention to throughout our experimentation was "isomorphism". This property is very simple: imagine a function that takes an email as input and transforms it into another value using a method (such as compression). Isomorphism ensures that we can 'undo' this method and obtain exactly the same result again:
decode(encode(x)) == x
This property is very important for emails because signatures exist in your email and these signatures result from the content of your email. If the email changes, these signatures change too.
For instance, the DKIM signature allows you to sign an email and check its integrity on receipt. ARC (which will be our next objective) also signs your emails, but goes one step further: all the relays that receive your email and send it back to the real destination must add a new ARC signature, just like adding a new block to the Bitcoin blockchain.
So you need to make sure that the way you serialise your email (in a PACK file) doesn't alter the content in order to keep these signatures valid! It just so happens that here too we have a lot of experience with Git. Git has the same constraint with Merkle-Trees and as far as we're concerned, we've developed a library that allows you to generate an encoder and a decoder from a description and that respects the isomorphism property by construction: the encore library.
Then we could store our emails as they are in the PACK file. However, the
advantage of duff
really comes into play when several objects are similar. In
the case of Git, tree objects are similar but they are not similar with
commits, for example. For emails, there is also such a distinction: the email
headers are similar but they are not similar to the email content.
You can therefore try to "split" emails into 2 parts, the header on one side and the content on the other. We would then have a third value which would tell us how to reconstruct our complete email (i.e. identify where the header is and identify where the content is).
However, after years of reading email RFCs, things are much more complex. Above all, this experience has enabled me to synthesise a skeleton that all emails have:
(* multipart-body :=
[preamble CRLF]
--boundary transport-padding CRLF
part
( CRLF --boundary transport-padding CRLF part )*
CRLF
--boundary-- transport-padding
[CRLF epilogue]
part := headers ( CRLF body )?
*)
type 'octet body =
| Multipart of 'octet multipart
| Single of 'octet option
| Message of 'octet t
and 'octet part = { headers : 'octet; body : 'octet body }
and 'octet multipart =
{ preamble : string
; epilogue : string * transport_padding;
; boundary : string
; parts : (transport_padding * 'octet part) list }
and 'octet t = 'octet part
As you can see, the distinction is not only between the header and the content but also between the parts of an email as soon as it has an attachment. You can also have an email inside an email (and I'm always surprised to see that this particular case is frequent). Finally, there's the annoying preamble and epilogue of an email with several parts, which is often empty but necessary: you always have to ensure isomorphism — even for "useless" bytes, they count for signatures.
We'll therefore need to serialise this structure and all we have to do is
transform a string t
and SHA1.t t
so that our structure no longer contains
the actual content of our emails but a unique identifier referring to this
content and which will be available in our PACK file.
module Format : sig
val t : SHA1.t Encore.t
end
let decode =
let parser = Encore.to_angstrom Format.t in
Angstrom.parse_string ~consume:All parser str
let encode =
let emitter = Encore.to_lavoisier Format.t in
Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter
However, we need to check that the isomorphism is respected. You should be
aware that work on Mr. MIME has already been done on this subject with
the afl fuzzer: check our assertion x == decode(encode(x))
. This
ability to check isomorphism using afl has enabled us to use the latter to
generate valid random emails. This allows me to reintroduce you to the
hamlet project, perhaps the biggest database of valid — but
incomprehensible — emails. So we've checked that our encoder/decoder for
“splitting” our emails respects isomophism on this million emails.
Carton, POP3 & mbox, some metrics
We can therefore split an email into several parts and calculate an optimal patch between two similar pieces of content. So now you can start packaging! This is where I'm going to reintroduce a tool that hasn't been released yet, but which allows me to go even further with emails: blaze.
This little tool is my Swiss army knife for emails! And it's in this tool that we're going to have fun deriving Carton so that it can manipulate emails rather than Git objects. So we've implemented the very basic POP3 protocol (and thanks to ocaml-tls for offering a free encrypted connection) as well as the mbox format.
Both are not recommended. The first is an old protocol and interacting with Gmail, for example, is very slow. The second is an old, non-standardised format for storing your emails — and unfortunately this may be the format used by your email client. After resolving a few bugs such as the unspecified behaviour of pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up with lots of emails that you'll have fun packaging!
$ mkdir mailbox
$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
-u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst
$ blaze.pack make -o mailbox.pack mails.lst
$ tar czf mailbox.tar.gz mailbox
$ du -sh mailbox mailbox.pack mailbox.tar.gz
97M mailbox
28M mailbox.pack
23M mailbox.tar.gz
In this example, we download the latest emails from the last 30 days via POP3
and store them in the mailbox/
folder. This folder weighs 97M and if we
compress it with gzip, we end up with 23M. The problem is that we need to
decompress the mailbox.tar.gz
document to extract the emails.
This is where the PACK file comes in handy: it only weighs 28M (so we're very
close to what tar
and gzip
can do) but we can rebuild our emails without
unpacking everything:
$ blaze.pack index mailbox.pack
$ blaze.pack list mailbox.pack | head -n1
0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
Delivered-To: romain.calascibetta@gmail.com
...
Like Git, we now associate a hash with our emails and can retrieve them using
this hash. Like Git, we also calculate the *.idx
file to associate the hash
with the position of the email in our PACK file. Just like Git (with git show
or git cat-file
), we can now access our emails very quickly. So we now have a
database system (read-only) for our emails: we can now archive our emails!
Let's have a closer look at this PACK file. We've developed a tool more or less
similar to git verify-pack
which lists all the objects in our PACK file and,
above all, gives us information such as the number of bytes needed to store
these objects:
$ blaze.pack verify mailbox.pack
4e9795e268313245f493d9cef1b5ccf30cc92c33 a 12 186 6257b7d4
...
517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b 666027 223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a
It shows the hash of our object, its type (A for the structure of our email, B for the content), its position in the PACK file, the number of bytes used to store the object (!) and finally the depth of the patch, the checksum, and the source of the patch needed to rebuild the object.
Here, our first object is not patched, but the next object is. Note that it only needs 223 bytes in the PACK file. But what is the real size of this object?
$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
--raw --without-metadata | wc -c
2014
So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of 10! In this case, the object is the content of an email. Guess which one? A GitHub notification! If we go back to our very first example, we saw that we could compress with a ratio close to 2. We could go further with zlib: we compress the patch too. This example allows us to introduce one last feature of PACK files: the depth.
$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
kind: b
length: 2014 byte(s)
depth: 10
cache misses: 586
cache hits: 0
tree: 000026ab
Δ 00007f78
...
Δ 0009ef74
Δ 000a29ab
...
In our example, our object requires a source which, in turn, is a patch
requiring another source, and so on (you can see this chain in the tree
).
The length of this patch chain corresponds to the depth of our object. There is
therefore a succession of patches between objects. What Carton tries to do is
to find the best patch from a window of possibilities and keep the best
candidates for reuse. If we unroll this chain of patches, we find a "base"
object (at 0x000026ab
) that is just compressed with zlib and the latter is
also the content of another GitHub notification email. This shows that Carton
is well on its way to finding the best candidate for the patch, which should be
similar content, moreover, another GitHub notification.
The idea is to sacrifice a little computing time (in the reconstruction of objects via their patches) to gain in compression ratio. It's fair to say that a very long patch chain can degrade performance. However, there is a limit in Git and Carton: a chain can't be longer than 50. Another point is the search for the candidate source for the patch, which is often physically close to the patch (within a few bytes): reading the PACK file by page (thanks to [Cachet][cachet]) sometimes gives access to 3 or 4 objects, which have a certain chance of being patched together.
Let's take the example of Carton and a Git object:
$ carton get pack-*.idx eaafd737886011ebc28e6208e03767860c22e77d
...
cache misses: 62
cache hits: 758
tree: 160720bb
Δ 160ae4bc
Δ 160ae506
Δ 160ae575
Δ 160ae5be
Δ 160ae5fc
Δ 160ae62f
Δ 160ae667
Δ 160ae6a5
Δ 160ae6db
Δ 160ae72a
Δ 160ae766
Δ 160ae799
Δ 160ae81e
Δ 160ae858
Δ 16289943
We can see here that we had to load 62 pages, but that we also reused the pages we'd already read 758 times. We can also see that the offset of the patches (which can be seen in Tree) is always close (the objects often follow each other).
Mbox and real emails
In a way, the concrete cases we use here are my emails. There may be a fairly
simple bias, which is that all these emails have the same destination:
romain.calascibetta@gmail.com. This is a common point which can also have a
significant impact on compression with duff
. We will therefore try another
corpus, the archives of certain mailing lists relating to OCaml:
lists.ocaml.org-archive
$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
-o opam-devel.pack
$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
> opam-devel.mbox.gzip
$ du -sh opam-devel.pack opam-devel.mbox.gzip \
lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
3.9M opam-devel.pack
2.0M opam-devel.mbox.gzip
10M lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
The compression ratio is a bit worse than before, but we're still on to something interesting. Here again we can take an object from our PACK file and see how the compression between objects reacts:
$ blaze.pack index opam-devel.pack
...
09bbd28303c8aafafd996b56f9c071a3add7bd92 b 362504 271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
--without-metadata --raw | wc -c
2098
Once again, we see a ratio of 10! Here the object corresponds to the header of
an email. This is compressed with other email headers. This is the situation
where the fields are common to all your emails (From
, Subject
, etc.).
Carton successfully patches headers together and email content together.
Next things
All the work done on email archiving is aimed at producing a unikernel (void
)
that could archive all incoming emails. Finally, this unikernel could send the
archive back (via an email!) to those who want it. This is one of our goals for
implementing a mailing list in OCaml with unikernels.
Another objective is to create a database system for emails in order to offer two features to the user that we consider important:
- quick and easy access to emails
- save disk space through compression
With this system, we can extend the method of indexing emails with other information such as the keywords found in the emails. This will enable us, among other things, to create an email search engine!
Conclusion
This milestone in our PTT project was quite long, as we were very interested in metrics such as compression ratio and software execution speed.
The experience we'd gained with emails (in particular with Mr. MIME) enabled us to move a little faster, especially in terms of serializing our emails. Our experience with ocaml-git also enabled us to identify the benefits of the PACK file for emails.
But the development of Miou was particularly helpful in satisfying us in terms of program execution time, thanks to the ability to parallelize certain calculations quite easily.
The format is still a little rough and not quite ready for the development of a keyword-based e-mail indexing system, but it provides a good basis for the rest of our project.
So, if you like what we're doing and want to help, you can make a donation via GitHub or using our IBAN.
Post
This little note is an extension of the feedback we got on the Carton release. nojb, in this case, pointed to the public-inbox software as the archiver of the various Linux kernel mailing lists. The latter is based on the same intuition we had, namely to use the PACK format to archive emails.
The question then arises: are we starting to remake the wheel?
In truth, the devil is in the detail. As it happens, you can download LKML mailing list archives with Git in this way:
$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git
$ cd lkml/git/15.git
$ du -sh objects/pack/pack-*.pack
981M objects/pack/pack-*.pack
$ cd objects/pack/
$ mkdir loose
$ carton explode 'loose/%s/%s' pack-*.pack
$ du -sh loose/c/
2.7G loose/c
public-inbox
is based not only on the PACK format for email archiving, but
also on Git concepts. In this case, such a Git repository actually only contains
an m
file corresponding to the last email received on the mailing list. The
other e-mails are "old versions of this e-mail". In this case, public-inbox
considers a certain form of versioning between emails. Each commit is a new
email and will "replace" the previous one.
Heuristics to patch
public-inbox
then relies on the heuristics implemented by Git to find the best
candidate for patching emails. These heuristics are explained
here. The idea is to consider a base object (which will be the
source of several patches) as the last version of your file (in the case of
public-inbox
, the last email received) and build patches of previous versions
with this base object. The heuristic comes from the spontaneous idea that, when
it comes to software files, these grow entropically. The latest version is
therefore most likely to contain all the similarities with previous versions.
Once again, when it comes to code, we tend to add code. So we should be able to use all the occurrences available in the latest version of a file to produce patches for earlier versions.
Comparison
Let's have some fun comparing public-inbox
and the blaze
tool:
+-------+--------------+------+
| blaze | public-inbox | raw |
+-----------+-------+--------------+------+
| caml-list | 160M | 154M | 425M |
+-----------+-------+--------------+------+
| lkml.15 | 1.1G | 981M | 2.7G |
+-----------+-------+--------------+------+
| kvm.0 | 1.2G | 1.1G | 3.1G |
+-----------+-------+--------------+------+
The first thing you'll notice is that blaze
produces PACK files that are a
little larger than those produced by Git. The problem is that blaze
doesn't
store exactly the same thing! The emails it stores are emails with lines ending
in \r\n
, whereas public-inbox
stores emails with \n
. It may just be a
small character, but multiplied by the number of lines in an email and the
number of emails in the archive, it's got its weight.
It's also true that decompress, the OCaml implementation of zlib,
is not as efficient as its C competitor in terms of ratio. So this is
disadvantage we have, which is not linked to the way we generate the PACK file
(we could replace decompress
with zlib!).
However, there's another interesting metric between what we produce and what
public-inbox
does. It's important to understand that we maintain "some
compatibility" with the Git PACK file. The objects aren't the same and don't
have the same meaning, but it's still a PACK file. As such, we can use git verify-pack
on our archive as on the public-inbox
archive:
+-----------------+------------------------+
| PACK from blaze | PACK from public-inbox |
+-----------+-----------------+------------------------+
| caml-list | ~2.5s | ~4.1s |
+-----------+-----------------+------------------------+
| lkml.15 | ~14.7s | ~16.3s |
+-----------+-----------------+------------------------+
| kvm.0 | ~18s | ~21s |
+-----------+-----------------+------------------------+
The analysis of our PACK file is faster than the one of public-inbox
. This is
where we need to understand what we're trying to store and how we're doing it.
When it comes to finding a candidate for a patch, blaze
relies solely on the
similarities between the two objects/emails they have, whereas public-inbox
,
via Git heuristics, will still prioritize a patch between emails that follow
each other in temporality via "versioning".
The implication is that the last 2 emails may have no similarity at all, but
Git/public-inbox
will still try to patch them together, as one is the
previous version (in terms of time) of the other.
Another aspect is that Git sometimes breaks the patch chain so that, when it
comes to extracting an object, if it's a patch, its source shouldn't be very far
away in the PACK file. Git prefers to patch an object with a source that may be
less good but close to it, rather than keeping the best candidate as the source
for all patches. Here too, blaze
reacts differently: we try to keep and reuse
the best candidate as much as possible.
A final difference, which may also be important, is the way in which emails are
stored. We often refer to e-mails as "split", whereas public-inbox
only stores
them as they are. One implication of this can be the extraction of an
attachment. As far as blaze
is concerned, we would just have to extract the
skeleton of the email, search in the various headers for the desired
attachment and extract the attachment as is, which is a full-fledged object in
our PACK file.
As for public-inbox
, we'd have to extract the email, parse the email, then
search for the part containing the attachment according to the header and
finally extract the attachment.
Conclusion
If we had to draw a "meta" conclusion from the differences between blaze
and
public-inbox
, it's that our tool focuses on the content of your emails,
whereas public-inbox
focuses on the historicity of your emails. As such, and
in the hope of making an OCaml-based email client, we believe our approach
remains interesting.
But these experiments have shown us 2 important things:
- we're capable of handling millions of emails, parsing and storing them. It's
pretty impressive to see our tool handle almost a million emails (
kvm.0
) without any bugs! - the second thing is that our initial intuition remains valid. Even if the path
seems subtly different from what
public-inbox
can do, our approach is clearly the right one and keeps us going.