Pushing the opam-repository into a sustainable repository
2025-03-26TL;DR: While growing numbers are great for marketing, in practise, keeping package versions that are not maintained, have been superseded, and are unlikely to be selected puts a burden (in terms of networ bandwidth and CPU consumption) on all opam clients (and CI systems, automated documentation builds, ...).
Opam and the opam-repository
Opam is the OCaml package management system, with since its inception - 1.0 released in 2013 - is an amazing community service and engages lots of people from all over the world to contribute to thousands of OCaml libraries.
Opam is mainly a solver for package dependencies: it specifies a file format for the package metadata, and comes with a default repository of packages which is community-maintained. Since it is a git repository, publishing a package is done by a "pull request". Nowadays, when opening such a pull request, a lot of builds are performed: the new package is tested on various OCaml versions, different platforms, and different configurations. Additionally, all reverse dependencies are checked to be compatible with the new release. This takes quite a lot of time, especially for packages that are widely used (such as cmdliner, dune, ...). The advantage is that, very likely, the new package release will be a smooth upgrade on all clients. Note that in the opam-repository the dependencies usually do not specify upper bounds, but they are added lazily when needed (i.e. when a new release of a dependency breaks the package).
The not-really documented policy in the opam-repository was to never remove a package, or even a version, but to only ever grow the repository. In the early years - let's say from 2013 until 2019 - the opam-repository contained 10,000 packages (around 2,000 unique packages - so each package had on average 5 releases). This was fine - even for small computers such as Raspberry Pi with limited CPU and memory. But while OCaml and opam are getting more popular, from 2019 until autumn 2021 the amount of packages increased to 20,000 packages - each being a separate file (text-based, no database involved) - the utility didn't scale linearly. Even basic operations take more than double of the time. Also, it requires quite a lot of memory. Until spring 2024, the amount of packages grew to 30,000, peaking at 33,000 early 2025.
How can we move forward, and prevent that clients need more and more CPU time? The CI system building the opam repository index takes easily 3 hours, smaller computers are not able to use the opam-repository anymore due to the excessive memory consumption.
Additionally, opam is packaged in lots of distributions - and opam is a great starting tool for OCaml developers. So upgrading opam (and the opam-repository) to use a different format (binary data) or a database would take a lot of time (years in the release cycle of Ubuntu or Debian), during which both the old and the new thing need to be maintained.
After a lot of discussions (started in spring 2023 in this opam-repository issue, continued e.g. on discuss) and several video-meetings with up to 25 participants - we concluded: (I) we need to have a backwards-compatible solution, (II) we will never remove a package that is still required by some other package (so we won't make packages uninstallable), (III) we rely on package authors to state their intention of maintenance of their packages (i.e. whether they maintain the latest version, or all major versions, ...), and (IV) we won't remove anything, but create a separate git repository, which is an archive for packages that are no longer in the opam-repository.
The backwards-compatible solution means we do not require any changes to the opam utility, or have to wait for an opam release. Our solution - to archive opam packages into a separate repository - works well with the currently published opam versions out of the box. Of course, moving opam to a database and/or to a different format can be done at the same time - it is just not the scope of this project, neither of this article.
Policy work
Over 2024, a lot of work has been put into the archiving policy and the archiving plan. It was announced in December 2024. To speed up the process, the OCaml Software Foundation put in a grant (10 PD, 4500 €) to pay us for communication, pushing the plan into reality, and do the necessary software development.
The reality of archiving
The plan included three phases: (1) create the archive repository and move all unavailable packages (marked with available: false
) there, (2) archive all packages that use OCaml < 4.08, and finally (3) archive packages that are marked as not intended to be maintained and are not required by any other still maintained package.
We set an ambitious plan: do phase 1 by January 1st 2025, phase 2 by February 1st 2025, and phase 3 by March 1st 2025.
Phase 1
For phase 1 we developed archive-opam, which serves multiple purposes: detecting the packages marked as unavailable, finding the packages that are no longer being installable, and adding the metadata to the opam files when (which git commit of the opam-repository) and why that package was archived. Note that this tool uses opam as a library, but doesn't invoke the solver at all, it uses the utilities for reading the opam files and figuring out the dependencies -- and then looks whether all dependency formulas are satisfied (implementation.
We finished just in time, and announced the list of packages to be archived by December 15th, and archived them on January 1st. This included 4170 packages, roughly 12.7% of the opam-repository.
Please note that there was already one package (qinap) being brought back to the main opam-repository since the source tarball has been recovered.
Phase 2
We extended the archive-opam utility to spot for OCaml bounds, and again looked for packages that are then no longer installable. We announced the archival of 5855 packages on January 15th, and archived them on February 1st. We looked for "ocaml" {< "4.08"}
, and only later (on February 1st) discovered that "ocaml" {< "4.08.0"}
was more precise, and found another 915 packages that we archived on February 15th.
We brought back three packages (base-num, base-ocamlbuild, and base-bytes) to the main opam-repository.
Phase 3
Scheduled for March, we had to delay it. The tool was not ready, the discussions about the exact semantics weren't there. We clarified the semantics, and decided that the opam-repository will keep packages so that each package will retain installation candidates for each supported OCaml version (from 4.08 on the latest patch version of each release: 4.08.1, 4.09.1, 4.10.2, 4.11.2, 4.12.1, 4.13.1, 4.14.2, 5.0.0, 5.1.1, 5.2.1, 5.3.0).
We finished by March 6th the phase 3 tool, which invokes the opam-0install solver nearly 100,000 times - it works by first figuring out the maintenance intent (opam field x-maintenance-intent
) for each opam package, and if it is (latest)
, using the solver (with all above mentioned OCaml versions) to produce a set of candidates to be archived. In a second step, all remaining opam packages are scanned which have missing dependencies. The third step is to use the vanilla opam-repository and the solver to find the packages and versions for the supported OCaml versions with the packages that have missing dependencies. Finally, we have the set of candidates to be removed from the first step, and the set of packages to retain from the third step, and compute the set difference.
From our run on March 6th (opam-repository at 8707d628f2beb80e7f60b89d60c33bdf2ffd9026), we have 5687 candidates and final 4735 packages to be archived. The tool easily uses 120GB of memory (likely some memory leak related to the opam-0install library) and took around 30h of CPU time (Intel(R) Xeon(R) CPU E5-2630L v4 @ 1.80GHz). We worked on paralellizing the utility, since there's lots of computations that can be done in parallel. But we ended up in some data races - luckily various have been fixed in opam-file-format using menhir instead of ocamllex/ocamlyacc - but more are around due to usage of Lazy.t
.
Due to some issue in "setup-ocaml" and large diffs when fetching the opam-repository (related to GNU patch and macOS), we have been asked to delay phase 3 until May 2025 when opam 2.4 will be available (which uses an OCaml implementation of patch and does not need many open file descriptors).
Conclusion
We already archived 10940 (- 4 which were brought back) packages, which is around 1/3 of all opam files in the opam-repository. We are looking forward to archiving more packages and getting opam into a slim and fast utility again.
We are very honoured and happy for the small OCSF grant on this project. Without the opam-repository maintainers and the larger community involvement, we wouldn't have reached the policy and tooling and pushing it into reality, though.
We finished this project - of course we are happy to see improvements in the tooling, please don't hesitate to open pull requests. Also, please add your desired x-maintenance-intent
to your packages so they can be considered (the default value is to retain all packages).
Our work is only partially funded, we cross-fund our work by commercial contracts and public (EU) funding. We are part of a non-profit company, you can make a (in the EU tax-deductible) donation (select "DONATION robur" in the dropdown menu), or sponsor us via the GitHub sponsor button.