blog.robur.coop

The Robur cooperative blog.
Back to index

GPTar

At Robur we developed a piece of software for mirroring or exposing an opam repository. We have it deployed at opam.robur.coop, and you can use it as an alternative to opam.ocaml.org. It is usually more up-to-date with the git opam-repository than opam.ocaml.org although in the past it suffered from occasional availability issues. I can recommend reading Hannes' post about opam-mirror. This article is about adding a partition table to the disk as used by opam-mirror. For background I can recommend reading the previously linked subsection of the opam-mirror article.

The opam-mirror persistent storage scheme

Opam-mirror uses a single block device for its persistent storage. On the block device it stores cached source code archives from the opam repository. These are stored in a tar archive consisting of files whose file name is the sha256 checksum of the file contents. Furthermore, at the end of the block device some space is allocated for dumping the cloned git state of the upstream (git) opam repository as well as caches storing maps from md5 and sha512 checksums respectively to the sha256 checksums. The partitioning scheme is entirely decided by command line arguments. In other words, there is no partition table on the disk image.

This scheme has the nice property that the contents of the tar archive can be inspected by regular tar utilities in the host system. Due to the append-only nature of tar and in the presence of concurrent downloads a file written to the archive may be partial or corrupt. Opam-mirror handles this by prepending a pending/ directory to partial downloads and to-delete/ directory for corrupt downloads. If there are no files after the failed download in the tar archive the file can be removed without any issues. Otherwise a delete would involve moving all subsequent files further back in the archive - which is too error prone to do robustly. So using the tar utilities in the host we can inspect how much garbage has accumulated in the tar file system.

The big downside to this scheme is that since the disk partitioning is not stored on the disk the contents can easily become corrupt if the wrong offsets are passed on the command line. Therefore I have for a long time been wanting to use an on-disk partition table. The problem is both MBR and GPT (GUID Partition Table) store the table at the beginning of the disk. If we write a partition table at the beginning it is suddenly not a valid tar archive anymore. Of course, in Mirage we can just write and read the table at the end if we please, but then we lose the ability to inspect the partition table in the host system.

GPT header as tar file name

My first approach, which turned out to be a dead end, was when I realized that a GPT header consists of 92 bytes at the beginning followed by reserved space for the remainder of the LBA. The reserved space should be all zeroes, but it seems no one bothers to enforce this. What's more is that a tar header starts with the file name in the first 100 bytes. This got me thinking we could embed a GPT header inside a tar header by using the GPT header as the tar header file name!

I started working on implementing this, but I quickly realized that 1) the tar header has a checksum, and 2) the gpt header has a checksum as well. Having two checksums that cover each other is tricky. Updating one checksum affects the other checksum. So I started reading a paper written by Martin Stigge et al. about reversing CRC as the GPT header use CRC32 checksum. I ended up writing something that I knew was incorrect.

Next, I realized the GPT header's checksum only covers the first 92 bytes - that is, the reserved space is not checksummed! I find this and the fact that the reserved space should be all zeroes but no one checks odd about GPT. This simplified things a lot as we don't have to reverse any checksums! Then I implemented a test binary that produces a half megabyte disk image with a hybrid GPT and tar header followed by a tar archive with a file test.txt whose content is Hello, World!. I had used the byte G as the link indicator. In POSIX.1-1988 the link indicators A-Z are reserved for vendor specific extensions, and it seemed G was unused. A mistake I made was to not update the tar header checksum - the ocaml-tar library doesn't support this link indicator value so I had manually updated the byte value in the serialized header but forgot to update the checksum. This was easily remediated as the checksum is a simple sum of the bytes in the header. The changes made are viewable on GitHub. I also had to work around a bug in ocaml-tar. GNU tar was successfully able to list the archive. A quirk is that the archive will start with a dummy file GPTAR which consists of any remaining space in the first LBA if the sector size is greater than 512 bytes followed by the partition table.

Protective MBR

Unfortunately, neither fdisk nor parted recognized the GPT partition table. I was able to successfully read the partition table using ocaml-gpt however. This puzzled me. Then I got a hunch: I had read about protective MBRs on the Wikipedia page on GPT. I had always thought it was optional and not needed in a new system such as Mirage that doesn't have to care too much about legacy code and operating systems.

So I started comparing the layout of MBR and tar. The V7 tar format only uses the first 257 bytes of the 512 byte block. The V7 format is differentiated by the UStar, POSIX/pax and old GNU tar formats by not having the string ustar at byte offset 257[1]. The master boot record format starts with the bootstrap code area. In the classic format it is the first 446 bytes. In the modern standard MBR format the first 446 bytes are mostly bootstrap code too with the exception of a handful bytes at offset 218 or so which are used for a timestamp or so. This section overlaps with the tar V7 linked file name field. In both formats these bytes can be zero without any issues, thankfully.

This is great! This means we can put a tar header in the bootstrap code area of the MBR and have it be a valid tar header and MBR record at the same time. The protective MBR has one partition of type 0xEE whose LBA starts at sector 1 and the number of LBAs should cover the whole disk, or be 0xFFFFFFFF (maximum representable number in unsigned 32 bit). In practice this means we can get away with only touching byte offsets 446-453 and 510-511 for the protective MBR. The MBR does not have a checksum which also makes things easier. Using this I could create a disk image that parted and fdisk recognized as a GPT partitioned disk! With the caveat that they both reported that the backup GPT header was corrupt. I had just copied the primary GPT header to the end of the disk. It turns out that the alternate, or backup, GPT header should have the current LBA and backup LBA fields swapped (and the header crc32 recomputed). I updated the ocaml-gpt code so that it can marshal alternate GPT headers.

Finally we can produce GPT partitioned disks that can be inspected with tar utilities!

$ /usr/sbin/parted disk.img print
WARNING: You are not superuser.  Watch out for permissions.
Model:  (file)
Disk /home/reynir/workspace/gptar/disk.img: 524kB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: pmbr_boot

Number  Start   End     Size   File system  Name              Flags
 1      17.4kB  19.5kB  2048B               Real tar archive  hidden

$ tar -tvf disk.img
?r-------- 0/0           16896 1970-01-01 01:00 GPTAR unknown file type ‘G’
-r-------- 0/0              14 1970-01-01 01:00 test.txt

The code is freely available on GitHub.

Future work

One thing that bothers me a bit is the dummy file GPTAR. By using the G link indicator GNU tar will print a warning about the unknown file type G, but it will still extract the dummy file when extracting the archive. I have been thinking about what tar header I could put in the MBR so tar utilities will skip the partition table but not try to extract the dummy file. Ideas I've had is to:

If you have other ideas what I can do please reach out!

  1. This is somewhat simplified. There are some more nuances between the different formats, but for this purpose they don't matter much.

    ↩︎︎