📦
Storage service
  • Introduction
  • How-to: basic operations
    • Ingest a bag into the storage service
    • Look up an already-stored bag in the storage service
    • Look up the versions of a bag in the storage service
  • How to: advanced usage
    • Getting notifications of newly stored bags
  • How to: debugging errors
    • Where to find application logs
    • Manually marking ingests as failed
  • Reference/design decisison
    • The semantics of bags, ingests and ingest types
    • How identifiers work in the storage service
    • How files are laid out in the underlying storage
    • Compressed vs uncompressed bags, and the choice of tar.gz
  • Developer information/workflow
    • An API reference for the user-facing storage service APIs
    • Key technologies
    • Inter-app messaging with SQS and SNS
    • How requests are routed from the API to app containers
    • Repository layout
    • How Docker images are published to ECR
  • Wellcome-specific information
    • Our storage configuration
      • Our three replicas: S3, Glacier, and Azure
      • Using multiple storage tiers for cost-efficiency (A/V, TIFFs)
      • Small fluctuations in our storage bill
      • Delete protection on the production storage service
    • Wellcome-specific debugging
      • Why did my callback to Goobi return a 401 Unauthorized?
    • Recovering files from our Azure replica
    • Awkward files and bags
    • Deleting files or bags bags from the storage service
Powered by GitBook
On this page
  1. How-to: basic operations

Ingest a bag into the storage service

PreviousIntroductionNextLook up an already-stored bag in the storage service

Last updated 2 years ago

This guide explains how to store a bag in the storage service.

You need:

  • A bag in . This bag should have a single External-Identifier in the .

    There is an example bag with the External-Identifier test_bag in the same directory as this guide.

You need to know:

  • The API URL for your storage service instance

  • The token URL for your storage service instance

  • A client ID and secret for the storage service

  • An upload bucket for the storage service

You need to choose:

  • A storage space. A space is an identifier that groups bags with similar content, e.g. digitised or born-digital.

To store a bag in the storage service:

  1. If not already compressed, tar-gzip compress your BagIt bag:

    tar -czf bag.tar.gz "$BAG_DIRECTORY"
  2. Upload the bag to your uploads bucket.

    aws s3 cp "bag.tar.gz" "s3://$UPLOADS_BUCKET/$UPLOADED_BAG_KEY"
  3. Fetch an access token for the OAuth2 credentials grant:

    curl -X POST "$TOKEN_URL" \
      --data grant_type=client_credentials \
      --data client_id="$CLIENT_ID" \
      --data client_secret="$CLIENT_SECRET"

    This will return a response like:

    {"access_token":"eyJraWQi...","expires_in":3600,"token_type":"Bearer"}

    Remember the access_token.

  4. Send a POST request to the /ingests API to create an ingest. This asks the storage service to store your bag.

    If this is the first bag with this (space, external identifier) pair, use ingest type "create". If there is already a bag with this (space, external identifier) pair, use ingest type "update".

    curl -X POST "$API_URL/ingests" \
      --header "Authorization: $ACCESS_TOKEN" \
      --header "Content-Type: application/json" \
      --data "{
        \"type\": \"Ingest\",
        \"ingestType\": {\"id\": \"$INGEST_TYPE\", \"type\": \"IngestType\"},
        \"space\": {\"id\": \"$SPACE\", \"type\": \"Space\"},
        \"sourceLocation\": {
          \"provider\": {\"id\": \"amazon-s3\", \"type\": \"Provider\"},
          \"bucket\": \"$UPLOADS_BUCKET\",
          \"path\": \"$UPLOADED_BAG_KEY\",
          \"type\": \"Location\"
        },
        \"bag\": {
          \"info\": {
            \"externalIdentifier\": \"$EXTERNAL_IDENTIFIER\",
            \"type\": \"BagInfo\"
          },
          \"type\": \"Bag\"
        }
      }"

    This returns a response like:

    {"id":"ffd3c8a3-9021-47bc-a68c-75eeaff1d4bd", ...}

    Remember the id -- this is the ingest ID.

  5. Use the ingest ID to query the state of the ingest:

    curl "$API_URL/ingests/$INGEST_ID" \
      --header "Authorization: $ACCESS_TOKEN"

    This will return an ingest.

    You can poll this API repeatedly to see the state of your ingest as it moves through the storage service.

the BagIt packaging format
bag-info.txt