These services will need to provide assets in the BagIt format, compressed and uploaded to an S3 bucket. They should then call an ingest API and provide a callback URL that will be notified when the ingest has succeeded or failed.
When there is a distinction between archival and access assets, these should be submitted as separate bags. This allows storing archival assets and access assets in different kinds of storage.
Storage
Two copies of every bag will be stored in S3, one using the Glacier storage class and the other using the Infrequent Access storage class. A copy of every bag will also be stored in Azure Blob Storage using the Archive storage class.
Bags will be versioned in storage and all previous versions will be kept indefinitely. We will adopt a forward delta versioning model, where files in more recent versions of bags can refer to files in earlier versions.
In conjunction with workflow systems that provide only changed files, this model will enable us to reduce our storage costs and the amount of unneccesary reprocessing of unchanged files.
Locations
The storage service will use two AWS S3 buckets and one Azure Blob Storage container:
Warm primary storage: AWS S3 IA, Dublin
Cold replica storage, same provider: AWS S3 Glacier, Dublin
Cold replica storage, different provider: Azure Blob Storage Archive, Netherlands
Within each location, assets will be grouped into related spaces of content and identified by source identifier e.g.:
/digitised/b0000000/{bag contents}
/born-digital/0000-0000-0000-0000/{bag contents}
Assets
Assets will be stored in the above spaces inside the BagIt bags that were transferred for ingest. Unlike during transfer, bags will be stored uncompressed. BagIt is a standard archival file format: https://tools.ietf.org/html/rfc8493
The BagIt specification is organized around the notion of a “bag”. A bag is a named file system directory that minimally contains:
a “data” directory that includes the payload, or data files that comprise the digital content being preserved. Files can also be placed in subdirectories, but empty directories are not supported
at least one manifest file that itemizes the filenames present in the “data” directory, as well as their checksums. The particular checksum algorithm is included as part of the manifest filename. For instance a manifest file with MD5 checksums is named “manifest-md5.txt”
a “bagit.txt” file that identifies the directory as a bag, the version of the BagIt specification that it adheres to, and the character encoding used for tag files
Any additional preservation formats created during the ingest workflow will be treated in the same way as any other asset and stored alongside the original files. Workflow systems are expected to record the link between original and derviatives assets in the METS files provided as part of the bag.
Bag description
The bag description created by the storage service provides a pointer to the stored bag and enough other metadata to provide a consumer with a comprehensive view of the contents of the bag. It is defined using types from a new Storage ontology and serialised using JSON-LD. We will use this to provide resources that describe stored bags, as part of the authenticated storage API.
This description does not contain metadata from the METS files within a bag, it is purely a storage level index. It will contain data from the bag-info.txt file and information about where the assets have been stored. METS files will be separately ingested in the catalogue and reporting pipelines.
Onward processing
The Versioned Hybrid Store which holds the bag descriptions provides an event stream of updates.
This event stream can be used to trigger downstream tasks, for example:
Sending a file for processing in our catalogue pipeline
Feeding other indexes (e.g. Elasticsearch) for reporting
The Versioned Hybrid Store also includes the ability to "reindex" the entire data store. This triggers an update event for every item in the data store, allowing you to re-run a downstream pipeline.
API
The storage service will provide an API that can be used to ingest bags and retrieve information about stored bags. This API will be available publicly, but require authentication using OAuth. Only trusted applications will be granted access to this API.
API base path: https://api.wellcomecollection.org/storage/v1
Authentication
All API endpoints must require authentication using OAuth 2.0. In the first instance, the only supported OAuth grant type will be client credentials.
Clients must first request a time-limited token from the auth service, using a client ID and secret that we will provide:
POST /oauth2/tokenHost:auth.wellcomecollection.orggrant_type=client_credentials&client_id=xxxxxxxxxx&client_secret=xxxxxxxxxx
Check that the supplied version matches the current version
Unpack the supplied bag
Store the supplied bag as a new version
Register the new version of the bag as the current version
Partial updates, where files that are not changed are not resupplied, are supported through the use of fetch.txt in the supplied bag. File references must specify the full storage location of the previously supplied file, including the version number of the bag in which it was last supplied in the path.
Updates with fetch files should be processed as follows:
Check that files in fetch.txt reference files in the correct bag
Check that files in fetch.txt exist most recently at the specified version
Process as for a complete update
An example of a bag that uses fetch.txt for updating digitised content is provided in later in this document.
Bags
Request:
GET /bags/{spaceId}/{externalId}[?version={version}]GET /bags/{spaceId}/{externalId}/versions[?before={version}]
Response:
See examples below
Examples
Digitised content
Digitised content will be ingested using Goobi, which should provide the bag layout defined below.
Complete bag
b24923333/
|-- data
| |-- b24923333.xml // mets "anchor" file for multiple manifestation
| [|-- b24923333_001.xml // mets file for vol 1]
| [|-- b24923333_002.xml // mets file for vol 2]
| \-- objects
| [\-- b24923333_001_001.jp2 // first image for vol 1]
| [\-- b24923333_001_002.jp2 // second image for vol 1]
| ...
| [\-- b24923333_002_001.jp2 // first image for vol 2]
| [\-- b24923333_002_002.jp2 // second image for vol 2]
| ...
| \-- alto
| [\-- b24923333_001_001.xml // text for image 1 vol 1]
| [\-- b24923333_001_002.xml // text for image 2 vol 1]
| ...
| [\-- b24923333_002_001.xml // text for image 1 vol 2]
| [\-- b24923333_002_002.xml // text for image 2 vol 2]
| ...
|-- manifest-sha256.txt
| a20eee40d609a0abeaf126bc7d50364921cc42ffacee3bf20b8d1c9b9c425d6f data/b24923333.xml
| e68c93a5170837420f63420bd626650b2e665434e520c4a619bf8f630bf56a7e data/objects/b24923333_001.jp2
| 17c0147413b0ba8099b000fc91f8bc4e67ce4f7d69fb5c2be632dfedb84aa502 data/alto/b24923333_001.xml
| ...
|-- tagmanifest-sha256.txt
| 791ea5eb5503f636b842cb1b1ac2bb578618d4e85d7b6716b4b496ded45cd44e manifest-sha256.txt
| 13f83db60db65c72bf5077662bca91ed7f69405b86e5be4824bb94ca439d56e7 bag-info.txt
| a39e0c061a400a5488b57a81d877c3aff36d9edd8d811d66060f45f39bf76d37 bagit.txt
|-- bag-info.txt
| Source-Organization: Intranda GmbH
| Contact-Email: support@intranda.com
| External-Description: A account of a voyage to New South Wales // title
| Bagging-Date: 2016-08-07
| External-Identifier: b24923333 // b number
| Payload-Oxum: 435255.8
| Internal-Sender-Identifier: 170131 // goobi process id
| Internal-Sender-Description: 12324_b_b24923333 // goobi process title
\-- bagit.txt
BagIt-Version: 0.97
Tag-File-Character-Encoding: UTF-8
Partial bag
Note that all files must be present the manifest and only files that are not supplied present in fetch.txt.
b24923333/
|-- data
| |-- b24923333.xml
|-- fetch.txt
| s3://wellcomecollection-storage-access/digitised/b24923333/v1/data/objects/b24923333_001.jp2 - data/objects/b24923333_001.jp2
| s3://wellcomecollection-storage-access/digitised/b24923333/v2/data/alto/b24923333_001.xml - data/alto/b24923333_001.xml
| ...
|-- manifest-sha256.txt
| a20eee40d609a0abeaf126bc7d50364921cc42ffacee3bf20b8d1c9b9c425d6f data/b24923333.xml
| e68c93a5170837420f63420bd626650b2e665434e520c4a619bf8f630bf56a7e data/objects/b24923333_001.jp2
| 17c0147413b0ba8099b000fc91f8bc4e67ce4f7d69fb5c2be632dfedb84aa502 data/alto/b24923333_001.xml
| ...
|-- tagmanifest-sha256.txt
| 791ea5eb5503f636b842cb1b1ac2bb578618d4e85d7b6716b4b496ded45cd44e manifest-sha256.txt
| 13f83db60db65c72bf5077662bca91ed7f69405b86e5be4824bb94ca439d56e7 bag-info.txt
| bf5077662bca91ed7f69401d877cx3agf318d4e85d7b6716b4b496ded45cd44e fetch.txt
| a39e0c061a400a5488b57a81d877c3aff36d9edd8d811d66060f45f39bf76d37 bagit.txt
|-- bag-info.txt
| Source-Organization: Intranda GmbH
| Contact-Email: support@intranda.com
| External-Description: A account of a voyage to New South Wales // title
| Bagging-Date: 2016-08-07
| External-Identifier: b24923333 // b number
| Payload-Oxum: 435255.8
| Internal-Sender-Identifier: 170131 // goobi process id
| Internal-Sender-Description: 12324_b_b24923333 // goobi process title
\-- bagit.txt
BagIt-Version: 0.97
Tag-File-Character-Encoding: UTF-8
METS
The existing METS structure should be change to reflect the following. The main change is removing data from Preservica and replacing it with PREMIS object metadata.
<?xml version='1.0' encoding='utf-8'?><mets:mets xmlns:dv="http://dfg-viewer.de/" xmlns:mets="http://www.loc.gov/METS/" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:premis="http://www.loc.gov/premis/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd http://www.loc.gov/standards/premis/ http://www.loc.gov/standards/premis/v2/premis-v2-0.xsd http://www.loc.gov/standards/mix/ http://www.loc.gov/standards/mix/mix.xsd">
<mets:metsHdrCREATEDATE="2016-01-06T07:36:48"> <mets:agentOTHERTYPE="SOFTWARE"ROLE="CREATOR"TYPE="OTHER"> <mets:name>Goobi - ugh-1.10-ugh-2.0.0-18-g99df876 - 21−May−2015</mets:name> <mets:note>Goobi</mets:note> </mets:agent> </mets:metsHdr> <mets:dmdSecID="DMDLOG_0000"> <mets:mdWrapMDTYPE="MODS"><!-- no change --></mets:dmdSec> <mets:dmdSecID="DMDPHYS_0000"><!-- no change --></mets:dmdSec> <mets:amdSecID="AMD"><!-- remove techMD for deliverable unit, so first file is now AMD_0001 --> <mets:techMDID="AMD_0001"> <mets:mdWrapMDTYPE="OTHER"MIMETYPE="text/xml"> <mets:xmlData><!-- replace Preservica data with PREMIS object as below --> <premis:object version="3.0" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" xsi:type="premis:file">
<premis:objectIdentifier> <premis:objectIdentifierType>local</premis:objectIdentifierType> <premis:objectIdentifierValue>b24923333_0001.jp2</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:significantProperties> <premis:significantPropertiesType>ImageHeight</premis:significantPropertiesType> <premis:significantPropertiesValue>4378</premis:significantPropertiesValue> </premis:significantProperties> <premis:significantProperties> <premis:significantPropertiesType>ImageWidth</premis:significantPropertiesType> <premis:significantPropertiesValue>2816</premis:significantPropertiesValue> </premis:significantProperties> <premis:objectCharacteristics> <premis:compositionLevel /> <premis:fixity> <premis:messageDigestAlgorithm>SHA-256</premis:messageDigestAlgorithm> <premis:messageDigest>0adcae8b53ba8af8d6fef0c1517ef822f0d0c3a7</premis:messageDigest> </premis:fixity> <premis:size>310448</premis:size> <premis:format> <premis:formatDesignation> <premis:formatName>JP2 (JPEG 2000 part 1)</premis:formatName> </premis:formatDesignation> <premis:formatRegistry> <premis:formatRegistryName>PRONOM</premis:formatRegistryName> <premis:formatRegistryKey>x-fmt/392</premis:formatRegistryKey> </premis:formatRegistry> </premis:format> </premis:objectCharacteristics> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> <mets:rightsMDID="RIGHTS"><!-- no change --></mets:rightsMD> <mets:digiprovMDID="DIGIPROV"><!-- no change --></mets:digiprovMD> </mets:amdSec> <mets:fileSec> <mets:fileGrpUSE="OBJECTS"><!-- change USE from SDB to OBJECTS --> <mets:fileID="FILE_0001_OBJECTS"MIMETYPE="image/jp2"><!-- change SDB suffix to OBJECTS --> <mets:FLocatLOCTYPE="URL"xlink:href="objects/b22454408_0001.jp2" /><!-- remove CHECKSUM --> </mets:file> </mets:fileGrp> </mets:fileSec> <mets:structMapTYPE="LOGICAL"><!-- no change --></mets:structMap> <mets:structMapTYPE="PHYSICAL"><!-- no change other than reflecting new IDs --></mets:structMap> <mets:structLink><!-- no change --></mets:structLink></mets:mets>