Archivematica @ Wellcome Collection
  • Introduction
  • High-level design
  • Storing born-digital files
    • Creating a transfer package
    • Upload a transfer package to S3
    • Check a package was stored successfully
    • Downloading a package from the storage service
    • Following a package in the dashboard
  • Service architecture
    • How does Archivematica work?
      • The Archivematica apps
      • Microservices, tasks and jobs
      • Gearman, ElastiCache, and the MCP server/client
    • How is our deployment unusual?
      • What are our extra services?
      • ECS containers on EC2, not Fargate
      • Why we forked Archivematica
    • How it fits into the wider platform
  • About our deployment
    • Using Wellcome catalogue identifiers
    • Different environments
    • Working storage: MySQL, Redis, and EBS
  • Administering Archivematica
    • Bootstrapping a new Archivematica stack
    • User management
      • How to add or remove users
      • Authentication with Azure AD
    • Upgrading to a new version of Archivematica
    • Running an end-to-end test
    • Clearing old transfers from the dashboard
  • Debugging Archivematica
    • Where to find application logs
    • Troubleshooting known errors
      • Timeout waiting for network interface provisioning to complete
      • 401 Unauthorized when the s3_start_transfer Lambda tries to run
      • "pull access denied" when running containers (and other ECS agent issues)
      • "Unauthorized for url" when logging in
      • "gearman.errors.ExceededConnectionAttempts: Exceeded 1 connection attempt(s)" in MCP server
      • NotADirectoryError in the Extract zipped transfer stage
    • Restarting services if a task is stuck
    • SSH into the Archivematica container hosts
Powered by GitBook
On this page
  • How our forks work / how overlays work
  • Updating to newer versions of Archivematica
  1. Service architecture
  2. How is our deployment unusual?

Why we forked Archivematica

PreviousECS containers on EC2, not FargateNextHow it fits into the wider platform

Last updated 1 year ago

We fork Archivematica to add support for our storage service. We've considered adding support to the upstream code (and deleting our forks), but this is non-trivial:

  • It means adding a new dependency to Archivematica (our storage service client library), which Artefactual are understandably reluctant to do.

  • Archivematica is designed to work with a variety of storage backends (e.g. S3, DuraCloud, Fedora), and our storage service is a bit of an "odd one out".

    Most of the storage backends can store packages very quickly, whereas our storage service is asynchronous and can sometimes take multiple hours to successfully store a package. We've had to change some of the code around timeouts and waiting for the storage backend.

How our forks work / how overlays work

Previously we maintained two completely separate copies of the Archivematica repositories (artefactual/archivematica and archivematica-storage-service), but because we only modify a handful of files we've replaced them with "overlays" that live in this repository.

The overlay works as follows:

  1. Clone the upstream Artefactual repository

  2. Copy our "overlay" files into the clone

  3. Run the docker build command inside the clone-plus-overlay

The overlay is designed to balance a few competing concerns:

  • We only want to diverge from the upstream Artefactual code in a handful of places

  • We don't want the overhead of a separate Archivematica fork

  • We want to be able to update to new versions of Archivematica

The overlay is best explained with an example:

This represents a Wellcome-specific version of the file src/archivematicaCommon/lib/storageService.py in the core Archivematica repo. When we build the Docker image, these files replace the upstream versions.

We keep both the upstream and Wellcome-specific copy in the tree so that we can easily see how we've diverged. This also allows us to maintain the divergence if the upstream code changes, because we can see what our changes from the original were.

Updating to newer versions of Archivematica

Because we only fork in a handful of places, we should be able to update to newer Archivematica versions relatively easily.

It should be sufficient to bump the version of the Artefactual repo that we clone.

When you bump the version, you may get errors from the copy_overlay_files.py script warning that there's a mismatch between upstream. This means that there have been changes in Archivematica that need to be mirrored to our repo.

To fix these errors:

  1. Diff the artefactual/wellcome copies of the file, to determine what changes we've made.

  2. Copy the latest file from the artefactual repo into our codebase, replacing both the artefactual/wellcome copies of the file.

  3. Reapply any changes from the wellcome copy which you saw in step 1.

Screenshot of a file tree. There's a folder called "vendor", which contains "src", which contains "archivematicaCommon", which contains "lib", which contains "storageService.artefactual.py" and "storageService.wellcome.py"