📦
Storage service
  • Introduction
  • How-to: basic operations
    • Ingest a bag into the storage service
    • Look up an already-stored bag in the storage service
    • Look up the versions of a bag in the storage service
  • How to: advanced usage
    • Getting notifications of newly stored bags
  • How to: debugging errors
    • Where to find application logs
    • Manually marking ingests as failed
  • Reference/design decisison
    • The semantics of bags, ingests and ingest types
    • How identifiers work in the storage service
    • How files are laid out in the underlying storage
    • Compressed vs uncompressed bags, and the choice of tar.gz
  • Developer information/workflow
    • An API reference for the user-facing storage service APIs
    • Key technologies
    • Inter-app messaging with SQS and SNS
    • How requests are routed from the API to app containers
    • Repository layout
    • How Docker images are published to ECR
  • Wellcome-specific information
    • Our storage configuration
      • Our three replicas: S3, Glacier, and Azure
      • Using multiple storage tiers for cost-efficiency (A/V, TIFFs)
      • Small fluctuations in our storage bill
      • Delete protection on the production storage service
    • Wellcome-specific debugging
      • Why did my callback to Goobi return a 401 Unauthorized?
    • Recovering files from our Azure replica
    • Awkward files and bags
    • Deleting files or bags bags from the storage service
Powered by GitBook
On this page
  1. Reference/design decisison

How identifiers work in the storage service

PreviousThe semantics of bags, ingests and ingest typesNextHow files are laid out in the underlying storage

Last updated 2 years ago

Bags in the storage service have a three-part identifier:

  • Space: the broad category of a bag. Examples: digitised, born-digital.

  • External identifier: the identifier of a bag within a space. This is typically an identifier from another system, which matches this bag to that record. Examples: b31497652, PP/CRI/A/2.

  • Version: an auto-incrementing numeric value. This tracks distinct versions of a (space, external identifier) pair. Examples: v1, v2, v3.

The space and external identifier are supplied by the user; the version is automatically generated by the storage service.

These three parts can be combined into a single string, which uniquely identifies a bag; for example digitised/b31497652/v2. This identifier is also the path to .

Why did we choose this approach?

  • We want identifiers that are human-readable and understandable. (As opposed to, say, .)

  • We match bags to records in systems outside the storage service (for example, the library catalogue). This approach allows us to use the same identifier as the external system, rather than .

  • This structure allows us to group related content by space within the underlying storage, in a way that is human-readable:

    digitised/
      ├── record1/
      │     ├── v1/
      │     │    ├── bagit.txt
      │     │    ├── bag-info.txt
      │     │    └── data/
      │     │          ├── record1_0001.jp2
      │     │          └── ...
      │     └── v2/
      │          └── ...
      │
      └── record2/
            └── v1/
                 └── ...

    The human-readable storage layout means our files are not tied to the specific software implementation of the storage service.

the root of the bag inside our storage buckets
UUIDs
inventing another type of identifier