RFC 002: Archival Storage Service
This RFC proposes a service for storing archival and access copies of digital assets, ensuring long-term preservation and compliance with industry standards.
Last modified: 2020-06-01T08:46:31+01:00
Problem statement
We need to provide a service for storing archival and access copies of digital assets.
This service should:
Ensure the safe, long-term (i.e. decades) storage of our digital assets
Provide a scalable mechanism for identifying, retrieving, and storing content
Follow industry best-practices around file integrity and audit trails
Enable us to meet NDSA Level 4 for both digitised and "born-digital" assets
Suggested solution
We will build a storage service based on Amazon S3 and DynamoDB.

Assets are first uploaded to an ingest bucket in S3
These assets are packaged in
.tar.gzfiles in the BagIt format, a Library of Congress standard for storing collections of digital files
The supplying system then initiates an ingest using an API, which:
Retrieves a copy of the bag from the ingest bucket
Unpacks and validates the bag, checking that the contents match those described by the BagIt metadata
Stores the bag in long-term storage and verifies it has been stored correctly
Creates a description of the stored bag and saves it to the Versioned Hybrid Store (a transactional store for large objects using S3 and DynamoDB)
Ingest
We'll need to integrate with other services such as:
Goobi - for digitisation workflow
Archivematica - for born-digital archives workflow
These services will need to provide assets in the BagIt format, compressed and uploaded to an S3 bucket. They should then call an ingest API and provide a callback URL that will be notified when the ingest has succeeded or failed.
When there is a distinction between archival and access assets, these should be submitted as separate bags. This allows storing archival assets and access assets in different kinds of storage.
Storage
Two copies of every bag will be stored in S3, one using the Glacier storage class and the other using the Infrequent Access storage class. A copy of every bag will also be stored in Azure Blob Storage using the Archive storage class.
Bags will be versioned in storage and all previous versions will be kept indefinitely. We will adopt a forward delta versioning model, where files in more recent versions of bags can refer to files in earlier versions.
In conjunction with workflow systems that provide only changed files, this model will enable us to reduce our storage costs and the amount of unneccesary reprocessing of unchanged files.
Locations
The storage service will use two AWS S3 buckets and one Azure Blob Storage container:
Warm primary storage: AWS S3 IA, Dublin
Cold replica storage, same provider: AWS S3 Glacier, Dublin
Cold replica storage, different provider: Azure Blob Storage Archive, Netherlands
Within each location, assets will be grouped into related spaces of content and identified by source identifier e.g.:
/digitised/b0000000/{bag contents}/born-digital/0000-0000-0000-0000/{bag contents}
Assets
Assets will be stored in the above spaces inside the BagIt bags that were transferred for ingest. Unlike during transfer, bags will be stored uncompressed. BagIt is a standard archival file format: https://tools.ietf.org/html/rfc8493
The BagIt specification is organized around the notion of a “bag”. A bag is a named file system directory that minimally contains:
a “data” directory that includes the payload, or data files that comprise the digital content being preserved. Files can also be placed in subdirectories, but empty directories are not supported
at least one manifest file that itemizes the filenames present in the “data” directory, as well as their checksums. The particular checksum algorithm is included as part of the manifest filename. For instance a manifest file with MD5 checksums is named “manifest-md5.txt”
a “bagit.txt” file that identifies the directory as a bag, the version of the BagIt specification that it adheres to, and the character encoding used for tag files
From: BagIt on Wikipedia
Any additional preservation formats created during the ingest workflow will be treated in the same way as any other asset and stored alongside the original files. Workflow systems are expected to record the link between original and derviatives assets in the METS files provided as part of the bag.
Bag description
The bag description created by the storage service provides a pointer to the stored bag and enough other metadata to provide a consumer with a comprehensive view of the contents of the bag. It is defined using types from a new Storage ontology and serialised using JSON-LD. We will use this to provide resources that describe stored bags, as part of the authenticated storage API.
This description does not contain metadata from the METS files within a bag, it is purely a storage level index. It will contain data from the bag-info.txt file and information about where the assets have been stored. METS files will be separately ingested in the catalogue and reporting pipelines.
Onward processing
The Versioned Hybrid Store which holds the bag descriptions provides an event stream of updates.
This event stream can be used to trigger downstream tasks, for example:
Sending a file for processing in our catalogue pipeline
Feeding other indexes (e.g. Elasticsearch) for reporting
The Versioned Hybrid Store also includes the ability to "reindex" the entire data store. This triggers an update event for every item in the data store, allowing you to re-run a downstream pipeline.
API
The storage service will provide an API that can be used to ingest bags and retrieve information about stored bags. This API will be available publicly, but require authentication using OAuth. Only trusted applications will be granted access to this API.
API base path: https://api.wellcomecollection.org/storage/v1
Authentication
All API endpoints must require authentication using OAuth 2.0. In the first instance, the only supported OAuth grant type will be client credentials.
Clients must first request a time-limited token from the auth service, using a client ID and secret that we will provide:
This will return an access token:
This token must be provided on all subsequent requests in the Authorization header:
Ingests
Storing a new bag
Request:
Response:
Request:
Response:
Updating an existing bag
As above, but use an ingestType of update. You must also supply the id and version of the bag being updated.
When storing an update, the service will:
Check that the supplied version matches the current version
Unpack the supplied bag
Store the supplied bag as a new version
Register the new version of the bag as the current version
Partial updates, where files that are not changed are not resupplied, are supported through the use of fetch.txt in the supplied bag. File references must specify the full storage location of the previously supplied file, including the version number of the bag in which it was last supplied in the path.
Updates with fetch files should be processed as follows:
Check that files in
fetch.txtreference files in the correct bagCheck that files in
fetch.txtexist most recently at the specified versionProcess as for a complete update
An example of a bag that uses fetch.txt for updating digitised content is provided in later in this document.
Bags
Request:
Response:
See examples below
Examples
Digitised content
Digitised content will be ingested using Goobi, which should provide the bag layout defined below.
Complete bag
Partial bag
Note that all files must be present the manifest and only files that are not supplied present in fetch.txt.
METS
The existing METS structure should be change to reflect the following. The main change is removing data from Preservica and replacing it with PREMIS object metadata.
API
Request:
Response:
Request:
Response:
Versions will be listed in decreasing order, with newer versions listed first.
Responses may be paginated -- use ?before=vN to see versions before vN.
Born-digital archives (AIPs)
Born-digital archives will be ingested using Archivematica, which has a pre-existing bag layout for AIPs that we have to adopt.
Bag
METS
The METS file will be as provided out of the box by Archivematica.
API
Request:
Response:
Last updated