RFC 002: Archival Storage Service

This RFC proposes a service for storing archival and access copies of digital assets, ensuring long-term preservation and compliance with industry standards.

Last modified: 2020-06-01T08:46:31+01:00

Problem statement

We need to provide a service for storing archival and access copies of digital assets.

This service should:

  • Ensure the safe, long-term (i.e. decades) storage of our digital assets

  • Provide a scalable mechanism for identifying, retrieving, and storing content

  • Follow industry best-practices around file integrity and audit trails

  • Enable us to meet NDSA Level 4 for both digitised and "born-digital" assets

Suggested solution

We will build a storage service based on Amazon S3 and DynamoDB.

architecture diagram
  • Assets are first uploaded to an ingest bucket in S3

    • These assets are packaged in .tar.gz files in the BagIt format, a Library of Congress standard for storing collections of digital files

  • The supplying system then initiates an ingest using an API, which:

    1. Retrieves a copy of the bag from the ingest bucket

    2. Unpacks and validates the bag, checking that the contents match those described by the BagIt metadata

    3. Stores the bag in long-term storage and verifies it has been stored correctly

    4. Creates a description of the stored bag and saves it to the Versioned Hybrid Store (a transactional store for large objects using S3 and DynamoDB)

Ingest

We'll need to integrate with other services such as:

These services will need to provide assets in the BagIt format, compressed and uploaded to an S3 bucket. They should then call an ingest API and provide a callback URL that will be notified when the ingest has succeeded or failed.

When there is a distinction between archival and access assets, these should be submitted as separate bags. This allows storing archival assets and access assets in different kinds of storage.

Storage

Two copies of every bag will be stored in S3, one using the Glacier storage class and the other using the Infrequent Access storage class. A copy of every bag will also be stored in Azure Blob Storage using the Archive storage class.

Bags will be versioned in storage and all previous versions will be kept indefinitely. We will adopt a forward delta versioning model, where files in more recent versions of bags can refer to files in earlier versions.

In conjunction with workflow systems that provide only changed files, this model will enable us to reduce our storage costs and the amount of unneccesary reprocessing of unchanged files.

Locations

The storage service will use two AWS S3 buckets and one Azure Blob Storage container:

  • Warm primary storage: AWS S3 IA, Dublin

  • Cold replica storage, same provider: AWS S3 Glacier, Dublin

  • Cold replica storage, different provider: Azure Blob Storage Archive, Netherlands

Within each location, assets will be grouped into related spaces of content and identified by source identifier e.g.:

  • /digitised/b0000000/{bag contents}

  • /born-digital/0000-0000-0000-0000/{bag contents}

Assets

Assets will be stored in the above spaces inside the BagIt bags that were transferred for ingest. Unlike during transfer, bags will be stored uncompressed. BagIt is a standard archival file format: https://tools.ietf.org/html/rfc8493

The BagIt specification is organized around the notion of a “bag”. A bag is a named file system directory that minimally contains:

  • a “data” directory that includes the payload, or data files that comprise the digital content being preserved. Files can also be placed in subdirectories, but empty directories are not supported

  • at least one manifest file that itemizes the filenames present in the “data” directory, as well as their checksums. The particular checksum algorithm is included as part of the manifest filename. For instance a manifest file with MD5 checksums is named “manifest-md5.txt”

  • a “bagit.txt” file that identifies the directory as a bag, the version of the BagIt specification that it adheres to, and the character encoding used for tag files

From: BagIt on Wikipedia

Any additional preservation formats created during the ingest workflow will be treated in the same way as any other asset and stored alongside the original files. Workflow systems are expected to record the link between original and derviatives assets in the METS files provided as part of the bag.

Bag description

The bag description created by the storage service provides a pointer to the stored bag and enough other metadata to provide a consumer with a comprehensive view of the contents of the bag. It is defined using types from a new Storage ontology and serialised using JSON-LD. We will use this to provide resources that describe stored bags, as part of the authenticated storage API.

This description does not contain metadata from the METS files within a bag, it is purely a storage level index. It will contain data from the bag-info.txt file and information about where the assets have been stored. METS files will be separately ingested in the catalogue and reporting pipelines.

Onward processing

The Versioned Hybrid Store which holds the bag descriptions provides an event stream of updates.

This event stream can be used to trigger downstream tasks, for example:

  • Sending a file for processing in our catalogue pipeline

  • Feeding other indexes (e.g. Elasticsearch) for reporting

The Versioned Hybrid Store also includes the ability to "reindex" the entire data store. This triggers an update event for every item in the data store, allowing you to re-run a downstream pipeline.

API

The storage service will provide an API that can be used to ingest bags and retrieve information about stored bags. This API will be available publicly, but require authentication using OAuth. Only trusted applications will be granted access to this API.

API base path: https://api.wellcomecollection.org/storage/v1

Authentication

All API endpoints must require authentication using OAuth 2.0. In the first instance, the only supported OAuth grant type will be client credentials.

Clients must first request a time-limited token from the auth service, using a client ID and secret that we will provide:

This will return an access token:

This token must be provided on all subsequent requests in the Authorization header:

Ingests

Storing a new bag

Request:

Response:

Request:

Response:

Updating an existing bag

As above, but use an ingestType of update. You must also supply the id and version of the bag being updated.

When storing an update, the service will:

  • Check that the supplied version matches the current version

  • Unpack the supplied bag

  • Store the supplied bag as a new version

  • Register the new version of the bag as the current version

Partial updates, where files that are not changed are not resupplied, are supported through the use of fetch.txt in the supplied bag. File references must specify the full storage location of the previously supplied file, including the version number of the bag in which it was last supplied in the path.

Updates with fetch files should be processed as follows:

  • Check that files in fetch.txt reference files in the correct bag

  • Check that files in fetch.txt exist most recently at the specified version

  • Process as for a complete update

An example of a bag that uses fetch.txt for updating digitised content is provided in later in this document.

Bags

Request:

Response:

See examples below

Examples

Digitised content

Digitised content will be ingested using Goobi, which should provide the bag layout defined below.

Complete bag

Partial bag

Note that all files must be present the manifest and only files that are not supplied present in fetch.txt.

METS

The existing METS structure should be change to reflect the following. The main change is removing data from Preservica and replacing it with PREMIS object metadata.

API

Request:

Response:

Request:

Response:

Versions will be listed in decreasing order, with newer versions listed first. Responses may be paginated -- use ?before=vN to see versions before vN.

Born-digital archives (AIPs)

Born-digital archives will be ingested using Archivematica, which has a pre-existing bag layout for AIPs that we have to adopt.

Bag

METS

The METS file will be as provided out of the box by Archivematica.

API

Request:

Response:

Last updated