📦
Storage service
  • Introduction
  • How-to: basic operations
    • Ingest a bag into the storage service
    • Look up an already-stored bag in the storage service
    • Look up the versions of a bag in the storage service
  • How to: advanced usage
    • Getting notifications of newly stored bags
  • How to: debugging errors
    • Where to find application logs
    • Manually marking ingests as failed
  • Reference/design decisison
    • The semantics of bags, ingests and ingest types
    • How identifiers work in the storage service
    • How files are laid out in the underlying storage
    • Compressed vs uncompressed bags, and the choice of tar.gz
  • Developer information/workflow
    • An API reference for the user-facing storage service APIs
    • Key technologies
    • Inter-app messaging with SQS and SNS
    • How requests are routed from the API to app containers
    • Repository layout
    • How Docker images are published to ECR
  • Wellcome-specific information
    • Our storage configuration
      • Our three replicas: S3, Glacier, and Azure
      • Using multiple storage tiers for cost-efficiency (A/V, TIFFs)
      • Small fluctuations in our storage bill
      • Delete protection on the production storage service
    • Wellcome-specific debugging
      • Why did my callback to Goobi return a 401 Unauthorized?
    • Recovering files from our Azure replica
    • Awkward files and bags
    • Deleting files or bags bags from the storage service
Powered by GitBook
On this page
  • 1. Get contributor access to our Azure replicas
  • 2. Install the Azure CLI and log in to your c_ cloud account with az login
  • 3. Identify the bag you want to delete / contains the files you want to delete
  • 3a. How to identify the bag
  • 4. Run the "delete bag" script
  • 5. Ask D&T to downgrade your permissions to the Azure replica
  1. Wellcome-specific information

Deleting files or bags bags from the storage service

PreviousAwkward files and bags

Last updated 1 year ago

It's rare, but sometimes we do need to delete all copies of a file and/or entire bag from the storage service. This page contains instructions for doing so.

Examples:

  • The Collections & Research team have asked us to delete some material

  • A bag was ingested under the wrong identifier; we've reingested it under the correct identifier and now we want to remove the incorrectly-labelled bag

This page contains instructions for deleting an entire bag.

If you want to delete or modify a file in an existing bag, you need to delete the stored bag, then re-ingest a modified version. In this case, it may be helpful to include a CHANGELOG.md file and/or add the bag to the .

1. Get contributor access to our Azure replicas

Open a Service Desk ticket asking for contributor access to our Azure replicas with your c_ cloud account.

Our bags are stored across a mixture of S3 and Azure. Your c_ cloud AWS access should give you access to the S3 buckets, but you shouldn't have write permissions in the Azure replica. By default, nobody has these permissions -- this is a deliberate choice, to prevent accidental deletions.

Even if you're only deleting bags in prod, ask for access to our staging replica as well -- this will allow you to test the procedure on non-essential material first.

Here's a script for the Service Desk ticket (fill in the details):

Temporary write access to the wecostorage{prod,stage} Azure storage account

We keep three copies of every file in Wellcome Collection's digital collections: two copies in Amazon S3, one copy in Azure Blob. The Azure copy lives in the wecostorageprod Azure storage account.

By default, nobody has write/delete access to all three copies – this is by design, to prevent somebody inadvertently deleting part of the collections. Our storage service has write-only access, so it can store new material, but it can't delete existing material.

We need to [explanation], and for this I need to be able to delete the copies we keep in Azure.

Please give my c_ cloud account write access to the wecostorage{prod,stage} Azure storage account, so that I can remove these files. This is usually done by assigning the "Contributor" role to the c_cloud account. Once this is done, I'll file a second request to downgrade my permissions again.

If you want approval, contact [name] – she'll confirm that we want to delete all copies of a particular set of images.

2. Install the Azure CLI and log in to your c_ cloud account with az login

This checks that everything is working. You can install the Azure CLI on macOS using Homebrew or pip.

3. Identify the bag you want to delete / contains the files you want to delete

This includes identifying:

  • the environment (prod/staging)

  • space

  • external identifier

  • version

You can only delete the latest version of a bag. If you want to delete every version of a bag, you'll have to delete them one-by-one, working backwards from the latest version. This is a by-product of the way versioning works.

3a. How to identify the bag

To confirm you have the right details for the bag, you can either:

  • Look in s3 and examine the corresponding bag-info.txt to compare the description with the original deletion request.

  • Run ss_get_bag.py with those details. Compare the value of info.externalDescription with your expectations.

4. Run the "delete bag" script

There's a ss_delete_bag.py script in the storage service repo.

You pass the details as command-line arguments, for example:

python3 scripts/ss_delete_bag.py \
  --environment staging \
  --space testing \
  --external-identifier archivematica-dev/TEST/1 \
  --version v30

This script will:

  • prompt you to confirm you do want to delete this bag

  • ask for a reason, which is recorded in DynamoDB

  • create a temporary copy of the bag in the wellcomecollection-storage-infra bucket (kept for 30 days)

  • remove the bag from the reporting cluster, all the objects in S3, all the blobs in Azure, and the bags/ingests APIs

Only run one deletion at a time. This is slower, but is necessary because of the way we handle the legal holds on the Azure replica – running two deletions at once may cause a conflict.

It can take a long time for the Azure phase of deletion to complete for a large bag.

5. Ask D&T to downgrade your permissions to the Azure replica

This returns us to the default "safe" state, where there's nobody with write permissions on all three replicas.

(see )

If you need to recover the bag using this copy, you can create a tgz file from it and it.

list of awkward files and bags
identifiers
ingest