📦
Storage service
  • Introduction
  • How-to: basic operations
    • Ingest a bag into the storage service
    • Look up an already-stored bag in the storage service
    • Look up the versions of a bag in the storage service
  • How to: advanced usage
    • Getting notifications of newly stored bags
  • How to: debugging errors
    • Where to find application logs
    • Manually marking ingests as failed
  • Reference/design decisison
    • The semantics of bags, ingests and ingest types
    • How identifiers work in the storage service
    • How files are laid out in the underlying storage
    • Compressed vs uncompressed bags, and the choice of tar.gz
  • Developer information/workflow
    • An API reference for the user-facing storage service APIs
    • Key technologies
    • Inter-app messaging with SQS and SNS
    • How requests are routed from the API to app containers
    • Repository layout
    • How Docker images are published to ECR
  • Wellcome-specific information
    • Our storage configuration
      • Our three replicas: S3, Glacier, and Azure
      • Using multiple storage tiers for cost-efficiency (A/V, TIFFs)
      • Small fluctuations in our storage bill
      • Delete protection on the production storage service
    • Wellcome-specific debugging
      • Why did my callback to Goobi return a 401 Unauthorized?
    • Recovering files from our Azure replica
    • Awkward files and bags
    • Deleting files or bags bags from the storage service
Powered by GitBook
On this page
  • Requirements
  • High-level design
  • Documentation
  • Repo
  • Getting started: use Terraform and AWS to run the storage service
  • How-to
  • Reference
  • Developer information

Introduction

NextIngest a bag into the storage service

Last updated 1 year ago

The storage service manages the storage of our digital collections, including:

  • Uploading files to cloud storage providers like Amazon S3 and Azure Blob

  • Verifying fixity information on our files (checksums, sizes, filenames)

  • Reporting on the contents of our digital archive through machine-readable APIs and search tools

Requirements

The storage service is designed to:

  • Ensure the safe, long-term (i.e. decades) storage of our digital assets

  • Provide a scalable mechanism for identifying, retrieving, and storing content

  • To support bulk processing of content, e.g. for file format migrations or batch analysis

  • Follow industry best-practices around file integrity and audit trails

  • Enable us to meet for both digitised and assets

High-level design

This is the basic architecture:

Workflow systems (Goobi, Archivematica) create "bags", which are collections of files stored in the BagIt packaging format. They upload these bags to a temporary S3 bucket, and call the storage service APIs to ask it to store the bags permanently.

The storage service reads the bags, verifies their contents, and replicates the bags to our permanent storage (S3 buckets/Azure containers). It is the only thing which writes to our permanent storage; this ensures everything is stored and labelled consistently.

Delivery systems (e.g. DLCS) can then read objects back out of permanent storage, to provide access to users.

Documentation

This GitBook space includes:

  • How-to guides explaining how to do common operations, e.g. upload new files into the storage service

  • Reference material explaining how the storage service is designed, and why we made those choices

  • Notes for Wellcome developers who need to administer or debug our storage service deployment

Repo

The READMEs in the repo have instructions for specific procedures, e.g. how to create new Docker images. This GitBook is meant to be a bit higher-level.


An ingest is a record of some processing on a bag, such as creating a new bag or adding a new version of a bag.

Getting started: use Terraform and AWS to run the storage service

How-to

Once you have a running instance of the storage service, you can use it to store bags. These guides walk you through some basic operations:

Once you're comfortable storing individual bags, you can read about more advanced topics:

  • [Storing multiple versions of the same bag]

  • [Sending a partial update to a bag]

  • [Storing preservation and access copies in different storage classes]

  • [Reporting on the contents of the storage service]

  • [Getting callback notifications from the storage service]

and some information about what to do when things go wrong:

  • [Why ingests fail: understanding ingest errors]

  • [Operational monitoring of the storage service]

Reference

These topics explain how the storage service work, and why it's designed in the way it is:

  • [Detailed architecture: what do the different services do?]

  • [How bags are verified]

  • [How bags are versioned]

Developer information

These topics are useful for a developer looking to modify or extend the storage service.

  • [Adding support for another replica location (e.g. Google Cloud)]

  • [Locking around operations in S3 and Azure Blob]

Developer workflow:

All our storage service code is in

The unit of storage in the storage service is a bag. This is a collection of files packaged together with , which are ingested and stored together.

Each bag is identified with a space (a broad category) an external identifier (a specific identifier) and a version. .

We have that spins up an instance of the storage service. You can use this to try the storage service in your own AWS account.

You can read the for more detailed information about how to use the storage service.

We also have the , the original design document -- although this isn't actively updated, and some of the details have changed in the implementation.

https://github.com/wellcomecollection/storage-service
the BagIt packaging format
Read more about identifiers
a Terraform configuration
Ingest a bag into the storage service
Look up an already-stored bag in the storage service
Look up the versions of a bag in the storage service
API reference
Getting notifications of newly stored bags
Manually marking ingests as failed
The semantics of bags, ingests and ingest types
How identifiers work in the storage service
How files are laid out in the underlying storage
Compressed vs uncompressed bags, and the choice of tar.gz
storage service RFC
An API reference for the user-facing storage service APIs
Key technologies
Inter-app messaging with SQS and SNS
How requests are routed from the API to app containers
Repository layout
How Docker images are published to ECR
NDSA Level 4
"born-digital"