RFC 078: Identifiers in iiif-builder: beyond the B number
IIIF-Builder (aka DDS) understands various identifier forms (BNumbers, CALM Reference Numbers and Work IDs), and makes processing decisions based on the form of the identifier. For example, if asked to process a b number, it knows the item must have been processed by Goobi, and it must be in the digitised
storage service space. These musts will soon no longer be true, and soon there will not even be b numbers.
Last modified: 2025-08-15T17:00+00:00
Context
The original motivation (with added emphasis):
[Collections would like to] ingest born-digital items through Archivematica under bnumbers from Sierra. This would mostly happen for digital art and material that is commissioned or acquired outside of archives. Material outside of archives must be ingested into Sierra as Calm is just for archival material.
Born-digital items under bnumbers should go into the born-digital workflow in Archivematica with some tweaking [...] IIIF Builder would need to pick up the Archivematica METS with bnumbers and handle them in the same way as they do with the Archivematica METS with the Calm ref nos.
Complicating this is the fact that Wellcome should be replacing Calm and Sierra
Current functionality
IIIF Builder understands what a B Number is, and can validate that they are correct using the check digit. It also translates identifiers for multiple manifestations generated by Goobi during digitisation into IIIF Collection and Manifest IDs. Therefore:
b24886932 becomes a IIIF Manifest
b33282262 becomes a IIIF Collection, where
b33282262_0008 is a volume of that 13-part collection
b19974760_207 is a volume of Chemist and Druggist, and
b19974760_207_0048 is an issue of Chemist and Druggist (three level hierarchical identifier)
To do this it parses incoming identifiers to understand what they are. All of the above start with a b number, but the following born digital, CALM identifiers do not:
MS.9178
SAPHY/Z/3/5/16/16
SAPHY_Z_3_5_16_16
The last of these is the same identifier as the second one, just in a path-safe form that can be used in Dashboard URLs.
Although identifiers always enter iiif-builder as strings (e.g., in API URIs, Dashboard URIs, SQS messages or text files to process), they are parsed into a DdsIdentifer object. The current C# code makes use of implicit operators allowing for easy-to-read code where identifiers can behave as strings or as the more complex DdsIdentifier class as required, without explicit conversion between forms. This is viable because it is cheap to parse a string, and at the moment we learn everything we need to know from parsing the identifier with procedural code - we don't need to look up third-party sources of information. DdsIdentifier
simply distinguishes between identifiers that have a b number, and those that do not; those that do not are assumed to be CALM. DdsIdentifier
also pulls out volume and issue parts, and translates CALM identifiers between the path-safe form used in the dashboard and the regular form used everywhere else.
The current iiif-builder codebase often has conditional logic like this:
if(identifier.HasBNumber)
{
// do something
}
else
{
// do something else
}
That condition is a proxy for what we really want to know:
What system processed it, Goobi or Archivematica? (and therefore what METS profile does it have?)
What system is that identifier an authority from (Sierra or CALM)?
Where are its files in the storage service (
/digitised/
or/born-digital/
)?
Upcoming challenges
At the moment we can parse a string using just the logic in DdsIdentifer and know that we can find a Goobi-generated METS file in the digitised
storage location, or an Archivematica-generated METS file in the born-digital
storage location. We know this just by looking at the string.
But in future:
Some Archivematica-processed born-digital items may have a b number and NOT have a CALM Reference Number
There won't even be B Numbers when Sierra is replaced by some other Library Management System
which will mean we cannot know METS formats, storage locations or anything else just by looking at the string.
Proposal: Introduce an Identity Service
The iiif-builder codebase will be significantly refactored. DdsIdentifier
will be replaced by a new class DdsIdentity
which:
is obtained from a service dependency, rather than parsed from a string:
// old (showing implicit conversion):
DdsIdentifier ddsId = "b33282262_0008";
// new
DdsIdentity ddsId = identityService.GetIdentity("b33282262_0008");
Has properties that directly reflect the things we need to know to process objects, rather than make decisions based on the form of the identifier:
if(ddsId.Generator == Generator.Goobi)
{
// expect a Goobi METS
}
if(ddsId.StorageSpace == StorageSpace.BornDigital)
{
// construct the right S3 key...
}
if(ddsId.Source == Source.Calm)
{
// some archive-specific logic
}
Retains the part-level volume and issue information we need for multiple manifestations, which does not exist in the Catalogue API
if(ddsId.VolumePart != null)
{
// this is to useful to over-abstract away into `partOf` chains
}
This means that we introduce an IIdentityService
interface that is introduced as a dependency in many parts of the codebase that could previously rely on automatic conversion between string
and DdsIdentifier
.
public interface IIdentityService
{
DdsIdentity GetIdentity(string s);
// Later:
//Task<DdsIdentity> GetIdentityAsync(string s);
}
[!IMPORTANT] While we don't yet know how later implementations of this interface will obtain their information when they can no longer parse it out of the identifier string, we have removed this concern from the rest of the iiif-builder codebase and need only worry about a new implementation of
IIdentityService
for future functionality.
Initial Experimental Implementation
This major refactor has already been done and tested in this pull request: https://github.com/wellcomecollection/iiif-builder/pull/282
This wires up the new IIdentityService interface as a service dependency and provides an implementation that essentially has the same parsing functionality as the previous version:
/src/Wellcome.Dds/Wellcome.Dds.Common/ParsingIdentityService.cs
This returns the new DdsIdentity object.
It also caches parsed DdsIdentity
objects in memory for efficiency. This won't make much difference now as the string parsing is very quick, but will be significant when the IIdentityService
implementation needs to make calls to other sources of information.
Next steps
Complete testing of this refactor, deploy to production. Current PR has 96 changed files.
For the "b numbers in archivematica" scenario, work out how we will know that the
Generator
property should beGenerator.Archivematica
and theStorageSpace
property should beBornDigital
implement / update our IIdentityService implementation
understand what the Sierra-replacement identifiers will look like and what they mean, so that:
Given any identifier string, we can develop an implementation of IIdentityService that populates the fields of
DdsIdentity
Last updated