Introduction

The storage service manages the storage of our digital collections, including:

Uploading files to cloud storage providers like Amazon S3 and Azure Blob
Verifying fixity information on our files (checksums, sizes, filenames)
Reporting on the contents of our digital archive through machine-readable APIs and search tools

Requirements

The storage service is designed to:

Ensure the safe, long-term (i.e. decades) storage of our digital assets
Provide a scalable mechanism for identifying, retrieving, and storing content
To support bulk processing of content, e.g. for file format migrations or batch analysis
Follow industry best-practices around file integrity and audit trails
Enable us to meet NDSA Level 4 for both digitised and "born-digital" assets

High-level design

This is the basic architecture:

Workflow systems (Goobi, Archivematica) create "bags", which are collections of files stored in the BagIt packaging format. They upload these bags to a temporary S3 bucket, and call the storage service APIs to ask it to store the bags permanently.

The storage service reads the bags, verifies their contents, and replicates the bags to our permanent storage (S3 buckets/Azure containers). It is the only thing which writes to our permanent storage; this ensures everything is stored and labelled consistently.

Delivery systems (e.g. DLCS) can then read objects back out of permanent storage, to provide access to users.

Documentation

This GitBook space includes:

How-to guides explaining how to do common operations, e.g. upload new files into the storage service
Reference material explaining how the storage service is designed, and why we made those choices
Notes for Wellcome developers who need to administer or debug our storage service deployment

Repo

All our storage service code is in https://github.com/wellcomecollection/storage-service

The READMEs in the repo have instructions for specific procedures, e.g. how to create new Docker images. This GitBook is meant to be a bit higher-level.

The unit of storage in the storage service is a bag. This is a collection of files packaged together with the BagIt packaging format, which are ingested and stored together.

An ingest is a record of some processing on a bag, such as creating a new bag or adding a new version of a bag.

Each bag is identified with a space (a broad category) an external identifier (a specific identifier) and a version. Read more about identifiers.

Getting started: use Terraform and AWS to run the storage service

We have a Terraform configuration that spins up an instance of the storage service. You can use this to try the storage service in your own AWS account.

How-to

Once you have a running instance of the storage service, you can use it to store bags. These guides walk you through some basic operations:

You can read the API reference for more detailed information about how to use the storage service.

Once you're comfortable storing individual bags, you can read about more advanced topics:

[Storing multiple versions of the same bag]
[Sending a partial update to a bag]
[Storing preservation and access copies in different storage classes]
[Reporting on the contents of the storage service]
[Getting callback notifications from the storage service]
Getting notifications of newly stored bags

and some information about what to do when things go wrong:

[Why ingests fail: understanding ingest errors]
[Operational monitoring of the storage service]
Manually marking ingests as failed

Reference

These topics explain how the storage service work, and why it's designed in the way it is:

The semantics of bags, ingests and ingest types
[Detailed architecture: what do the different services do?]
How identifiers work in the storage service
How files are laid out in the underlying storage
[How bags are verified]
[How bags are versioned]
Compressed vs uncompressed bags, and the choice of tar.gz

We also have the storage service RFC, the original design document -- although this isn't actively updated, and some of the details have changed in the implementation.

Developer information

These topics are useful for a developer looking to modify or extend the storage service.

An API reference for the user-facing storage service APIs
Key technologies
[Adding support for another replica location (e.g. Google Cloud)]
Inter-app messaging with SQS and SNS
How requests are routed from the API to app containers
[Locking around operations in S3 and Azure Blob]

Developer workflow:

NextIngest a bag into the storage service

Last updated 2 years ago