MirageVPN & tls-crypt-v2

2023-11-14

In 2019 Robur started working on a OpenVPN™-compatible implementation in OCaml. The project was funded for 6 months in 2019 by prototypefund. In late 2022 we applied again for funding this time to the NGI Assure open call, and our application was eventually accepted. In this blog post I will explain why reimplementing the OpenVPN™ protocol in OCaml is a worthwhile effort, and describe the Miragevpn implementation and in particular the tls-crypt-v2 mechanism.

What even is OpenVPN™?

OpenVPN™ is a protocol and software implementation to provide virtual private networks: computer networks that do not exist in hardware and are encrypted and tunnelled through existing networks. Common use cases for this is to provide access to internal networks for remote workers, and for routing internet traffic through another machine for various reasons e.g. when using untrusted wifi, privacy from a snooping ISP, circumventing geoblock etc.

It is a protocol that has been worked on and evolved over the decades. OpenVPN™ has a number of modes of operations as well as a number of options in the order of hundreds. The modes can be categorized into two main categories: static mode and TLS mode. The former mode uses static symmetric keys, and will be removed in the upcoming OpenVPN™ 2.7 (community edition). I will not focus on static mode in this post. The latter uses separate data & control channels where the control channel uses TLS - more on that later.

Why reimplement it? And why in OCaml?

Before diving into TLS mode and eventually tls-crypt-v2 it's worth to briefly discuss why we spend time reimplementing the OpenVPN™ protocol. You may ask yourself: why not just use the existing tried and tested implementation?

OpenVPN™ community edition is implemented in the C programming language. It heavily uses the OpenSSL library^[1] which is as well written in C and has in the past had some notable security vulnerabilities. Many vulnerabilities and bugs in C can be easily avoided in other languages due to bounds checking and stricter and more expressive type systems. The state machine of the protocol can be more easily be expressed in OCaml, and some properties of the protocol can be encoded in the type system.

Another reason is Mirage OS, a library operating system implemented in OCaml. We work on the Mirage project and write applications (unikernels) using Mirage. In many cases it would be desirable to be able to connect to an existing VPN network^[2], or be able to offer a VPN network to clients using OpenVPN™.

Consider a VPN provider: The VPN provider runs many machines that run an operating system in order to run the user-space OpenVPN™ service. There are no real users on the system, and a lot of unrelated processes and legacy layers are around that are not needed. With a Mirage OS unikernel, which is basically a statically linked binary and operating system such a setup becomes simpler with fewer layers. With reproducible builds deployment and updates will be straightforward.

Another very interesting example is a unikernel for Qubes OS that we have planned. Qubes OS is an operating system with a high focus on security. It offers an almost seamless experience of running applications in different virtual machines on the same machine. The networking provided to a application (virtual machine) can be restricted to only go through the VPN. It is possible to use OpenVPN™ for such a setup, but that requires running OpenVPN™ in a full Linux virtual machine. With Mirage OS the resource footprint is typically much smaller than an equivalent application running in a Linux virtual machine; often the memory footprint is smaller by an order.

Finally, while it's not an explicit goal of ours, reimplementing a protocol without an explicit specification can help uncover bugs and things that need better documentation in the original implementation.

TLS mode

There are different variants of TLS mode, but what they share is separate "control" channel and "data" channel. The control channel is used to do a TLS handshake, and with the established TLS session data channel encryption keys, username/password authentication, etc. is negotiated. Once this dance has been performed and data channel encryption keys have been negotiated the peers can exchange IP packets over the data channel.

Over the years a number of mechanisms has been implemented to protect the TLS stack from being exposed to third parties, protect against denial of service attacks and to hide information exchanged during a TLS handshake such as certificates (which was an isue before TLS 1.3). These are known as tls-auth, tls-crypt and tls-crypt-v2. The tls-auth mechanism adds a pre-shared key for hmac authentication on the control channel. This makes it possible for an OpenVPN™ server to reject early clients that don't know the shared key before any TLS handshakes are performed. In tls-crypt the control channel is encrypted as well as hmac authenticated using a pre-shared key. Common to both is that the pre-shared key is shared between the server and all clients. For large deployments this significantly reduces the usefulness - the key is more likely to be leaked the greate the number of clients who share this key.

tls-crypt-v2

To improve on tls-crypt, tls-crypt-v2 uses one pre-shared key per client. This could be a lot of keys for the server to keep track of, so instead of storing all the client keys on the server the server has a special tls-crypt-v2 server key that is used to wrap the client keys. That is, each client has their own client key as well as the client key wrapped using the server key. The protocol is then extended so the client in the first message appends the wrapped key unencrypted. The server can then decrypt and verify the client key and decrypt the rest of the packet. Then the client and server use the client key just as in tls-crypt.

This is great! Each client can have their own key, and the server doesn't need to keep a potentially large database of client keys. What if the client's key is leaked? A detail I didn't mention is that the wrapped key contains metadata. By default this is the current timestamp, but it is possible on creation to put any (relative short) binary data in there as the metadata. The server can then be configured to check the metadata by calling a script.

An issue exists that an initial packet takes up resources on the server because the server needs to

decrypt the wrapped key, and
keep the unwrapped key and other data in memory while waiting for the handshake to complete.

This can be abused in an attack very similar to a TCP SYN flood. Without tls-crypt-v2 OpenVPN uses a specially crafted session ID (a 64 bit identifier) to avoid this issue similar to SYN cookies. To address this in OpenVPN 2.6 the protocol for tls-crypt-v2 was extended yet further with a 'HMAC cookie' mechanism. The client sends the same packet as before, but uses a sequence number 0x0f000001 instead of 1 to signal support of this mechanism. The server responds in a similar manner with a sequence number of 0x0f000001 and the packet is appended with a tag-length-value encoded list of flags. At the moment only one tag and one value is defined which signifies the server supports HMAC cookies - this seems unnecessarily complex, but is done to allow future extensibility. Finally, if the server supports HMAC cookies, the client sends a packet where the wrapped key is appended in cleartext. The server is now able to decrypt the third packet without having to keep the key from the first packet around and can verify the session id.

Cool! Let's deploy it!

Great! We build on a daily basis unikernels in our reproducible builds setup. At the time of writing we have published a Miragevpn router unikernel acting as a client. For general instructions on running Mirage unikernels see our reproducible builds blog post. The unikernel will need a block device containing the OpenVPN™ configuration and a network device. More detailed instructions Will Follow Soon™! Don't hesitate to reach out to us on GitHub, by mail or me personally on Mastodon if you're stuck.

It is possible to compile OpenVPN™ community edition with Mbed TLS instead of OpenSSL which is written in C as well.
↩︎︎
I use the term "VPN network" to mean the virtual private network itself. It is a bit odd because the 'N' in 'VPN' is 'Network', but without disambiguation 'VPN' could refer to the network itself, the software or the service.
↩︎︎