📦
Storage service
  • Introduction
  • How-to: basic operations
    • Ingest a bag into the storage service
    • Look up an already-stored bag in the storage service
    • Look up the versions of a bag in the storage service
  • How to: advanced usage
    • Getting notifications of newly stored bags
  • How to: debugging errors
    • Where to find application logs
    • Manually marking ingests as failed
  • Reference/design decisison
    • The semantics of bags, ingests and ingest types
    • How identifiers work in the storage service
    • How files are laid out in the underlying storage
    • Compressed vs uncompressed bags, and the choice of tar.gz
  • Developer information/workflow
    • An API reference for the user-facing storage service APIs
    • Key technologies
    • Inter-app messaging with SQS and SNS
    • How requests are routed from the API to app containers
    • Repository layout
    • How Docker images are published to ECR
  • Wellcome-specific information
    • Our storage configuration
      • Our three replicas: S3, Glacier, and Azure
      • Using multiple storage tiers for cost-efficiency (A/V, TIFFs)
      • Small fluctuations in our storage bill
      • Delete protection on the production storage service
    • Wellcome-specific debugging
      • Why did my callback to Goobi return a 401 Unauthorized?
    • Recovering files from our Azure replica
    • Awkward files and bags
    • Deleting files or bags bags from the storage service
Powered by GitBook
On this page
  • b19974760: Chemist and Druggist
  • SATSY/1821/4: filename that ends with a dot
  • b13135934: missed the initial Azure verification
  • miro/V0047000: a single image was removed
  1. Wellcome-specific information

Awkward files and bags

PreviousRecovering files from our Azure replicaNextDeleting files or bags bags from the storage service

Last updated 1 year ago

This is a list of files and bags that are known to have caused issues, and are recorded here in case they cause problems in future.

b19974760: Chemist and Druggist

This bag contains the digitised volumes of the journal , which amounts to nearly a million files. This is substantially larger than any other bag in the storage service.

Because it's so much larger than any other bag, it's quite difficult to store. When we originally stored this bag, we had to tweak queue timeouts and retries to get it working. Additionally, there's a hard-coded exception in the bag register for returning the associated storage manifest, because it's such a large JSON object to deserialise.

Now it's stored it should be fine, but the sheer number of files may be a problem the next time it's processes.

SATSY/1821/4: filename that ends with a dot

This bag contains a file whose name ends in a dot (.).

The current bag verifier will prevent you storing such a file, because a trailing dot isn't allowed in . When we originally stored this file, we were only storing objects in S3, so we didn't have this restriction.

The object has been replicated to both S3 and Azure, but under slightly different names. The name in S3 ends with a dot; the name in Azure omits the dot.

Now it's stored it should be fine, but it's notable as the one known discrepancy between the S3 and Azure replicas.

b13135934: missed the initial Azure verification

This is a bag which missed its initial Azure verification, and then some of the blobs got cycled to the Archive tier.

It was originally ingested in March 2022, but Azure replication failed for an unknown reason. At least some of the files were replicated, but the bag didn't store successfully. (We didn't notice for months, so the application logs had all rotated before we could debug.)

When we attempted to complete the ingest in January 2023, the replication completed, but now Azure verification failed -- anything replicated in March 2022 had been cycled to Azure's archive tier, and so they were inaccessible to the verifier.

Although I was able to rehydrate the blobs, the verifier still got errors from Azure:

com.azure.storage.blob.models.BlobStorageException: Status code 403, (empty body)

I could download the blobs to an EC2 instance using the Azure CLI, and I verified the bag by hand – so I sent a message "Azure verification complete" to the replica aggregator to complete the ingest and register the bag.

I didn't have time to debug why the verifier can't read blobs which have been rehydrated from the archive tier, but since we can retrieve them through the Azure CLI I don't think it's a major issue.

miro/V0047000: a single image was removed

Bags in the Miro space contain files from Wellcome Images. They were processed using Archivematica. In July 2023, Collections asked us to permanently remove a single image from the Miro data. This image was present in these two bags:

  • miro/library/V Images/V0047000 – as a JPEG2000

  • miro/library/jpg/V0047000_2 – as a JPEG

We removed the image from these bags by:

  1. Deleting the original bag

  2. Uploading a replacement bag with this image removed

This means the bag metadata is correct, but the Archivematica metadata is wrong – in particular the Archivematica METS file refers to an image which is no longer present.

This approach was chosen for simplicity – reprocessing the bag using Archivematica would have been much more complicated and would risk losing metadata.

For details on which image was removed and why, see the CHANGELOG.md file which has been added to both bags.

Chemist + Druggist
Azure blob names