RFC 083: Stable identifiers following mass record migration
This RFC discusses what will happen to public catalogue identifiers following the mass migration of records from CALM/Sierra to Axiell Collection / Folio and how we can update the catalogue pipeline to accommodate this change.
Last modified: 2026-02-10T12:00:00+00:00
Related issues:
Table of contents
Context
The catalogue pipeline currently generates public catalogue identifiers in the ID Minter step, with a one to one relationship between source identifiers retrieved by adapters and extracted by transformers (e.g. Sierra B numbers, CALM reference numbers) and public catalogue identifiers.
flowchart LR
subgraph Adapter
A[Adapter]
ADB[("Stores:<br/><b>adapter IDs</b>")]
end
subgraph Transformer
T[Transformer]
TDB[("Stores:<br/><b>source IDs</b>")]
end
subgraph ID Minter
M[ID Minter]
MDB[("Stores:<br/><b>source ID → catalogue ID</b>")]
end
A -->|"source IDs"| T
T -->|"source IDs<br/>+ merge candidates"| M
M -->|"catalogue IDs"| D[Downstream]Identifier types
We distinguish between three types of identifiers in the catalogue pipeline:
Adapter identifiers
Adapter identifiers are used to identify records within the adapter service and datastore. These identifiers may or may not correspond to source identifiers, but must uniquely identify records in the source system, in order to track which records have been updated.
Use of adapter identifiers will not be affected by the migration of records between source systems.
Source identifiers
Source identifiers are identifiers assigned by source systems and must be unique within those systems. Examples of source identifiers include:
Sierra B numbers (e.g. "sierra-system-number/b1161044x")
CALM Record IDs (e.g. "calm-record-id/77b9dab1-c5b0-4207-9876-da2b49b0306d")
These are extracted by the relevant transformer step and passed downstream in the pipeline.
The migration of records from CALM/Sierra to Axiell Collection/Folio will result in new source identifiers being assigned to records. This means that records which were previously identified by Sierra B numbers or CALM Record IDs will now have different source identifiers in the new systems. Records in the new systems will contain metadata referring to the old identifiers, but these will not be the primary source identifiers used in the catalogue pipeline.
Public catalogue identifiers
Public catalogue identifiers follow a set of rules to ensure they are stable, human readable, and unique. This issue discusses the requirements of the ID Minter at the inception of the service: https://github.com/wellcomecollection/platform/issues/31.
At time of writing the ID scheme is implemented in Scala, see Identifiable.scala.
The rules for minting public catalogue identifiers are:
Must be 8 characters long
Uses characters a-z and 1-9
Excludes confusing characters: 'o', 'i', 'l', '1'
First character must be a letter (XML identifiers cannot start with numbers)
This results in an ID space of approximately 0.6 trillion unique identifiers, see this notebook for details.
Public catalogue identifiers are generated in the ID Minter step of the catalogue pipeline, by looking up existing identifiers in the ID Registry (an Aurora database) or minting new ones as required by generating a random identifier and checking for collisions in the ID Registry.
Public catalogue identifiers are currently minted in a one to one relationship with source identifiers. This means that at present each source identifier corresponds to a unique public catalogue identifier.
If we continue to use source identifiers as the basis for minting public catalogue identifiers, the migration of records from CALM/Sierra to Axiell Collection/Folio will result in records receiving new public catalogue identifiers. This is because the source identifiers will change, and therefore the ID Minter will see them as new records requiring new public catalogue identifiers.
Identifier usage in catalogue pipeline steps
This section describes how identifiers are used and transformed at each step of the catalogue pipeline.
Adapter
The adapter uses adapter identifiers internally to track records in its datastore. These are often the same as source identifiers, but may differ depending on the source system. For example, the Sierra adapter uses the record's internal database ID as the adapter identifier. The adapter outputs records tagged with a source identifier (e.g. sierra-system-number/b1161044x).
Transformer
The transformer receives records identified by their source identifier. It extracts structured data from the source record and outputs a transformed work document. Crucially, the transformer also identifies merge candidates — other records that should be merged with this one. Merge candidates are expressed as source identifiers (e.g. miro-image-number/V0012345).
The transformer outputs:
The work's source identifier (the primary identity of this record)
A list of merge candidate source identifiers (records to be merged with this work)
ID Minter
The ID Minter receives transformed works with source identifiers and merge candidates. For each source identifier (both the work's own identifier and all merge candidates), it either:
Looks up an existing public catalogue identifier from the ID Registry, or
Mints a new catalogue identifier if one doesn't exist
The minter maintains a one-to-one mapping between source identifiers and catalogue identifiers in the ID Registry (Aurora database). After processing, each source identifier on the document is enriched with its corresponding catalogue identifier.
The identifiers database enforces the following constraints:
CanonicalId
The public catalogue identifier (e.g. a2345bcd)
OntologyType
The type of entity (e.g. Work, Image)
SourceSystem
The source system type (e.g. sierra-system-number, calm-record-id)
SourceId
The identifier value within the source system (e.g. b1161044x)
Primary key on
CanonicalId: Each catalogue identifier is globally unique and can only appear once in the registry.Unique key on (
OntologyType,SourceSystem,SourceId): Each source identifier (for a given ontology type) maps to exactly one catalogue identifier. This prevents the same source record from being assigned multiple catalogue identifiers.
The ID Minter outputs:
The work with its catalogue identifier (mapped from its source identifier)
Merge candidates enriched with their catalogue identifiers
Matcher
The matcher receives works with catalogue identifiers and uses the merge candidates to build a graph of connected works. It maintains an adjacency list (in DynamoDB) keyed by catalogue identifier, which tracks which catalogue identifiers should be merged together.
When a work arrives, the matcher:
Looks up existing connections for its catalogue identifier
Adds connections for any new merge candidate catalogue identifiers
Produces a subgraph of all transitively connected catalogue identifiers
The matcher operates entirely on catalogue identifiers — source identifiers are not used at this step.
Merger
The merger receives a set of connected catalogue identifiers from the matcher and retrieves the full work documents for each. It then applies merge rules to combine these works into a single merged work.
Importantly, the merger uses source identifier types (not the identifiers themselves) to determine precedence rules. For example, a Sierra record might take precedence over a METS record for certain fields. The merge rules are expressed in terms of source system types (e.g. "sierra-system-number" vs "mets-file").
The merger outputs:
A single merged work with one canonical catalogue identifier
The merged work contains the combined data from source records, based on field-level precedence rules determined by source types
A set of redirected catalogue identifiers for all merged records not chosen as the target
Summary of identifier usage
Adapter
Adapter IDs (internal)
Source IDs
Converts internal IDs to source system IDs
Transformer
Source IDs
Source IDs + merge candidate source IDs
Identifies related records for merging
ID Minter
Source IDs
Catalogue IDs
Maps source → catalogue IDs (1:1)
Matcher
Catalogue IDs
Graph of catalogue IDs
Builds merge graph using catalogue IDs only
Merger
Catalogue IDs + source types
Single catalogue ID
Uses source types for precedence rules
Impact of migration on public catalogue identifiers
If we make no changes to the ID Minter during the migration, the following will occur:
All migrated records will receive new public catalogue identifiers. When records are migrated from CALM/Sierra to Axiell Collections/Folio, they will be assigned new source identifiers by the new systems. Since the ID Minter maps source identifiers to catalogue identifiers on a one-to-one basis, these new source identifiers will result in new public catalogue identifiers being minted for every migrated record.
Some source systems will continue to reference old identifiers. Systems such as METS will continue to refer to old Sierra/CALM identifiers indefinitely in existing records. However, newer records in these systems will receive new identifiers from the new source systems.
Pipeline changes are still required. Even without modifying the ID Minter, the following changes would be necessary:
Axiell Collections & Folio transformers would need to emit merge candidates targeting the appropriate source identifiers from Sierra & CALM. This would allow the matcher to connect the new records to their corresponding old records in the merge graph.
Merger precedence rules would need updating to allow the new source systems (Axiell Collections, Folio) to take precedence over the old systems (Sierra, CALM) when determining the canonical record for merged works.
This approach would preserve the connection between old and new records through the merge graph, but users would see different public catalogue identifiers for what is conceptually the same record. Any existing bookmarks, citations, or external links to catalogue URLs would break unless redirects are maintained from old catalogue identifiers to new ones.
Mass redirection on the public website. Because the new source system records would take precedence in the merger, the old catalogue identifiers (from Sierra/CALM records) would become redirect targets rather than canonical identifiers. This would result in the majority of works on the public collection website being redirected to new URLs.
This is undesirable for several reasons:
SEO impact. Search engines would need to reindex the entire collection under new URLs. Even with proper 301 redirects in place, there would be a period of reduced search visibility and potential loss of page authority accumulated over time.
Broken external references. Academic citations, Wikipedia links, bookmarks, and partner integrations would all redirect to different URLs, creating user confusion and potentially failing in systems that don't follow redirects correctly.
Analytics discontinuity. Usage analytics and reporting would be fragmented, with historical data associated with old identifiers and new data with new identifiers.
Redirect maintenance burden. The redirect mappings would need to be maintained indefinitely.
Proposed solution
Support multiple source identifiers per catalogue ID.
Overview
Modify the ID Minter to allow multiple source identifiers to map to the same public catalogue identifier. When a new Axiell Collections or Folio record arrives that references an old Sierra/CALM identifier (via a "predecessor" relationship), it should inherit the existing catalogue identifier rather than receiving a new one.
This is achieved through:
A new database schema that separates canonical ID management from source ID mappings
Support for pre-generating canonical IDs to improve minting performance
Transformer changes to emit predecessor source identifiers for migrated records
Rewriting the ID Minter as a Python Lambda following the catalogue pipeline modernisation pattern
Current service architecture
The existing ID Minter is implemented as a Scala Lambda, which can be invoked via SQS or Step Functions depending on context. It:
Receives document IDs to process (via SQS messages or Step Functions invocation)
Retrieves corresponding work documents from the transformer's Elasticsearch index
Extracts source identifiers and merge candidates from the document
Mints or looks up canonical IDs for each identifier in the Aurora database
Annotates the work document with canonical IDs
Writes the annotated document to its own Elasticsearch index for downstream consumers
This Scala Lambda will be rewritten in Python as part of the broader catalogue pipeline modernisation.
Database schema
Current schema
The current schema enforces a 1:1 relationship between canonical IDs and source identifiers via the primary key on CanonicalId.
Proposed schema
Discovering aliases
When multiple source identifiers share a canonical ID, the original record can be distinguished from aliases by CreatedAt timestamp — the earliest row is the original, later rows are aliases:
Future migrations
This schema supports additional source system migrations without modification. If a future migration introduces another successor system (e.g. migrating from Folio to a new LMS), new records simply declare predecessors pointing to Folio source identifiers. Multiple aliases can chain: a canonical ID might have mappings from Sierra → Axiell → FutureSystem, all discoverable via the shared CanonicalId and ordered by CreatedAt.
Schema design rationale
canonical_ids
Registry of all canonical IDs in use, enforces global uniqueness, enables pre-generation of IDs
identifiers
Maps source identifiers to canonical IDs, allows multiple source IDs to share a canonical ID
Constraints
Canonical ID uniqueness: The primary key on
canonical_ids.CanonicalIdensures each catalogue ID is globally uniqueSource ID uniqueness: The primary key on
identifiersensures each source identifier maps to exactly one catalogue IDReferential integrity: The foreign key ensures all canonical IDs in
identifiersexist incanonical_idsIntentional sharing only: Multiple source identifiers can only share a canonical ID through explicit predecessor relationships — accidental collisions are impossible
ID pre-generation
The canonical_ids table supports pre-generating IDs to eliminate collision checking during minting.
Pre-generation process
A background job maintains a pool of free IDs:
Claiming a free ID
The SKIP LOCKED clause ensures concurrent processes each claim different free IDs without blocking. If a process crashes before commit, the transaction rolls back and the ID returns to free automatically.
Benefits of pre-generation
No collision retry loop: Current minting generates a random ID, checks for collision, and retries if needed. With pre-generation, IDs are guaranteed unique at claim time.
Predictable performance: Minting time is constant regardless of how full the ID space becomes.
Batch efficiency: IDs can be generated in bulk during off-peak hours.
ID Minter logic
Lookup and minting flow
A source identifier is a tuple of (OntologyType, SourceSystem, SourceId) — for example, (Work, sierra-system-number, b1161044x).
Combined lookup query
The lookup query retrieves both the new source ID and its predecessor (if specified) in a single database round-trip. Each row is tagged to indicate whether it matches the new source ID or the predecessor. Since predecessors may have different ontology types, the query matches on the full tuple:
The query returns 0, 1, or 2 rows:
0 rows: Neither found — raise exception if predecessor was specified, otherwise mint new ID
1 row with
MatchType = 'new': New source ID already exists — return itsCanonicalId1 row with
MatchType = 'predecessor': Only predecessor found — insert new source ID with predecessor'sCanonicalId2 rows: Both found — return the
CanonicalIdfrom the new source ID row
When no predecessor is specified, the query simplifies to a direct lookup:
Idempotent writes
All inserts to the identifiers table use idempotent writes to handle concurrent processes:
Returning unused IDs to the pool
When a claimed ID is not used (because another process won the race), return it to the pool within the same transaction:
Batch operations
The ID Minter processes work documents that typically contain multiple source identifiers (the work's own identifier plus merge candidates). Batch operations reduce database round-trips for the common case where most identifiers already exist.
For a detailed demonstration of batch operations, predecessor inheritance, and race condition handling, see id_minter_demo.ipynb.
Batch lookup
The majority of records processed by the ID Minter already have canonical IDs. A batch lookup fetches all existing mappings in a single query, supporting mixed ontology types:
This returns only the IDs that exist — missing IDs require minting.
Batch minting flow
The mint_ids() method processes source IDs in a single transaction with ~6 queries regardless of batch size:
The flow:
Batch lookup source IDs + predecessor IDs together (single query)
Fail fast if any predecessors are missing
Batch INSERT for predecessor inheritance cases (inherit existing canonical ID)
Claim free IDs from pool (
SELECT ... FOR UPDATE SKIP LOCKEDwith LIMIT N)Batch INSERT for new records with claimed IDs (
ON DUPLICATE KEY UPDATEfor idempotency)Verify assignments — query back which canonical IDs were actually used (race detection)
Mark assigned only IDs that were used, commit transaction
Race detection and ID pool management
Between the initial lookup and INSERT, another process may have committed a mapping for a source ID we thought was new. The verification query detects this:
If our claimed ID appears in the result → we won the race, mark as 'assigned'
If a different ID appears → another process won, our claimed ID stays 'free'
This ensures the free ID pool is not depleted by lost races. The FOR UPDATE SKIP LOCKED ensures concurrent transactions claim different free IDs without blocking.
Atomicity
All operations occur within a single transaction:
If predecessor validation fails, nothing is committed
If the free ID pool is exhausted, nothing is committed
Successful batches commit all mappings together
Concurrency scenarios
Scenario 1: Same source identifier, two processes
1
Query identifiers for sierra/b1234 → not found
Query identifiers for sierra/b1234 → not found
2
Claim free ID abc123 (SKIP LOCKED)
Claim free ID xyz789 (SKIP LOCKED)
3
INSERT (idempotent) → succeeds
INSERT (idempotent) → no-op (row exists)
4
Verify → abc123 was used, mark assigned, COMMIT
Verify → abc123 was assigned (not xyz789)
5
Return abc123
xyz789 stays free (not marked assigned), COMMIT
6
Return abc123
Outcome: Both processes return the same canonical ID. SKIP LOCKED ensures no blocking. The verification query detects that Process B's claimed ID was not used, so it remains in the free pool.
Scenario 2: Different source IDs, same predecessor
Step
Process A (axiell/123, predecessor sierra/b1234)
Process B (axiell/456, predecessor sierra/b1234)
1
Query axiell/123 OR sierra/b1234 → sierra/b1234 has abc123
Query axiell/456 OR sierra/b1234 → sierra/b1234 has abc123
2
INSERT (axiell/123, abc123) → succeeds
INSERT (axiell/456, abc123) → succeeds
3
Return abc123
Return abc123
Outcome: Both source identifiers correctly share the same canonical ID. No conflict.
Scenario 3: Predecessor doesn't exist
Step
Process B (successor axiell/123, predecessor sierra/b1234)
1
Query axiell/123 OR sierra/b1234 → neither found
2
Raise exception
Outcome: Process B fails with an error indicating the predecessor record was not found.
Rationale: If a record declares a predecessor, that predecessor should already exist in the database. The predecessor record would have been processed through the pipeline from the old source system (e.g., Sierra) before or during the migration to the new system (e.g., Axiell Collections). A missing predecessor indicates either:
Data integrity issue: The migrated record references a predecessor that doesn't exist in the source system
Pipeline configuration error: The old system's pipeline was disconnected before all records were processed
Incorrect predecessor reference: The transformer is emitting an invalid predecessor identifier
All of these are error conditions that should surface immediately rather than being masked by retry logic. Silent retries could hide data quality issues or configuration problems that need investigation.
Scenario 4: Re-processing without predecessor
Step
First processing (axiell/123, predecessor sierra/b1234)
Re-processing (axiell/123, no predecessor)
1
Query axiell/123 OR sierra/b1234 → sierra/b1234 has abc123
Query axiell/123 → found with abc123
2
INSERT (axiell/123, abc123) → succeeds
Return abc123
3
Return abc123
Outcome: The canonical ID is stable regardless of whether a predecessor is provided on subsequent calls.
Key property: The predecessor relationship is only meaningful at first-mint time. Once a source ID is assigned a canonical ID, that mapping is fixed in the database. Re-processing the same source ID — whether with a different predecessor, no predecessor, or the same predecessor — always returns the existing canonical ID.
This makes the ID Minter idempotent and resilient to transformer changes. If a transformer's predecessor logic is updated or removed after initial minting, existing canonical IDs remain unchanged.
Predecessor source identifiers
Transformers for Axiell Collections and Folio will emit a predecessor source identifier when the record contains a reference to an old Sierra/CALM identifier. This is distinct from merge candidates:
Merge candidates: "This record should be merged with these other records"
Predecessor: "This record should inherit the catalogue ID of this old record"
Example: An Axiell Collections record migrated from Sierra might emit:
Source identifier:
axiell-collections-id/12345Predecessor:
sierra-system-number/b1161044x
The ID Minter will look up the predecessor's catalogue ID and assign it to the new source identifier.
Data migration
A runnable demonstration of the complete migration process is available in schema_migration.ipynb. The notebook walks through each step below using sample data from the current ID Minter database.
Migration steps
Service rewrite
The ID Minter will be rewritten as a Python Lambda function, following the pattern established by the modernisation of other catalogue pipeline services. The new service will be invoked as part of an AWS Step Functions state machine.
Architecture
Input/Output
The Lambda receives a batch of document IDs as input, matching the existing Scala interface:
The Lambda then:
Retrieves the work documents from the transformer Elasticsearch index using these document IDs
Extracts all source identifiers requiring minting (primary source ID, merge candidates, predecessor if present)
Mints or looks up canonical IDs for each identifier
Annotates the document with canonical IDs
Writes the annotated document to the minted works index
Output confirms which documents were successfully processed:
Responsibilities
The new Lambda retains all responsibilities of the current service:
Document retrieval
Fetch work documents from the transformer's Elasticsearch index using document IDs
ID extraction
Parse source identifiers, merge candidates, and predecessors from documents
ID minting
Mint new canonical IDs or look up existing ones from the Aurora database
Predecessor inheritance
Inherit canonical IDs from predecessor source identifiers during migration
ID annotation
Annotate source identifiers and merge candidates with their canonical IDs
Document storage
Write annotated documents to the minted works Elasticsearch index
Hybrid operation with legacy transformers
During the transition period, some transformers (TEI, METS, etc.) will continue to operate outside Step Functions while Sierra & CALM are replaced by Axiell Collections and Folio transformers. The ID Minter must support both invocation patterns:
Strategy options:
Serverless point-to-point integration connecting SQS to Lambda
No code, built-in batching/filtering/transformation, fully managed
Additional AWS service to configure
Configure Lambda with both Step Functions and SQS triggers
Single service, simple
Lambda concurrency management more complex; no input transformation
Migrate all transformers first
Complete transformer modernisation before ID Minter rewrite
Clean cutover
Delays ID Minter work; may not be feasible
Recommended approach: EventBridge Pipes — Pipes provide a serverless, zero-code bridge from SQS to Lambda with several advantages:
Input transformation: Can transform legacy SQS message format to match Step Functions input format, keeping Lambda code unified
Batching controls: Advanced batching configuration (batch size, batching window) without Lambda configuration changes
Filtering: Can filter messages before invoking Lambda, reducing unnecessary invocations
Enrichment: Optionally enrich events by calling another Lambda or API Gateway before the target
Error handling: Built-in dead-letter queue support and retry configuration
This approach keeps the ID Minter Lambda identical regardless of invocation source — the Pipe handles any format differences between legacy SQS messages and Step Functions input.
Pipeline changes required
Transformer changes
Axiell Collections transformer: Emit predecessor source identifier when record contains Sierra B number reference
Folio transformer: Emit predecessor source identifier when record contains CALM reference number
Merger changes
Update precedence rules to allow new source systems to take precedence:
Axiell Collections > Sierra
Folio > CALM
ID Minter changes
Update database access layer for new schema
Add predecessor lookup logic
Implement free ID claiming from pool with
SKIP LOCKEDFail fast on empty pool (rely on retry mechanisms)
Benefits
Catalogue ID stability
New source ID = new catalogue ID
Migrated records inherit existing catalogue ID
Mass redirection
All migrated records redirect
No redirection for migrated records
External references
All break or redirect
Preserved
SEO
Reindexing required
No impact
Analytics continuity
Fragmented
Preserved
Minting performance
Collision retry loop
Pre-claimed IDs, no retries
Risks and mitigations
Free ID pool exhaustion
Fail fast; rely on SQS/Step Functions retry; alerts trigger pre-generation job
Migration data integrity
Run migration in stages with verification; keep old table until confirmed
Incorrect predecessor extraction by transformer
Validate predecessor format; log all predecessor lookups for audit; reconciliation process to detect mismatches
Orphaned canonical IDs (marked assigned but not in identifiers)
Periodic reconciliation job to detect and clean up
Note: Predecessor records (Sierra/CALM) will all be present in the database before any successor records (Axiell Collections/Folio) are processed. This is guaranteed by the source system switchover process — no new predecessor records will be created after switchover.
Alternative solutions considered
This section describes alternative approaches that were considered for maintaining stable public catalogue identifiers during the CALM/Sierra to Axiell Collections/Folio migration.
For background on the problem, see:
Summary of approaches
Transformer ID swap
Swap old/new IDs in transformer
Rejected — adds complexity, misleading source labels
Database swap
Replace old IDs with new IDs directly in database
Rejected — creates orphaned IDs, doesn't preserve history
Runtime ID matching
Minter identifies original source at runtime
Rejected — combines downsides of other approaches
Transformer ID swap
Approach: Handle identifier continuity in the transformer rather than the ID Minter. When a record in the new system (Axiell/Folio) has been migrated from the old system (Sierra/CALM), the transformer would:
Detect the old identifier stored as an alternative identifier in the new record
Swap the old and new identifiers, so the record continues to use its Sierra/CALM source identifier downstream
Pass the new Axiell/Folio identifier as an alternative identifier instead
Example:
Adapter output:
axiell-collections-id/12345(new identifier)Transformer output:
sierra-system-number/b1161044x(old identifier, swapped to primary position)Alternative identifier:
axiell-collections-id/12345
Concerns identified:
Misleading source system labels: Records coming from Folio would have Sierra B numbers and be labelled as
sierra-system-numberin their source identifier, which is confusing and dishonest about the record's actual source.Read-before-update required: Handling deletions and updates would require reading the existing record to find the ID to propagate. This breaks the pure function nature of the transformer, impacting performance.
Risk of drift: The old identifier field in the new system is not a controlled identifier. It could be edited or removed at any point, potentially orphaning records downstream.
Multiple potential old identifiers: A former Sierra record might also contain a CALM ID. The transformer would need logic to identify which old ID is the "correct" one to use.
Multiplies transformer concerns: Keeping permanent information about ID history in the transformer is undesirable as it adds complexity to what should be a straightforward transformation.
Outcome: Rejected. While this approach would work, it adds significant complexity to the transformer and results in records that misrepresent their source system. The ID Minter is a more appropriate place to handle identifier mapping.
Database swap
Approach: Make no application changes. Instead, directly modify the ID Minter database before the migration:
Generate a list of mappings between old (Sierra/CALM) and new (Axiell/Folio) source identifiers
Update the database to replace all old source identifiers with new source identifiers
The ID Minter continues operating with its 1:1 constraint, but now uses new identifiers
Concerns identified:
Orphaned canonical IDs: If we later want to retain the old identifiers (e.g. as alternative identifiers on works), we'd create pointless new canonical IDs for them. This essentially moves the problem rather than solving it.
Loss of history: The database would no longer record that a canonical ID was originally minted for a Sierra/CALM record.
All-or-nothing migration: Requires complete identifier mapping before switchover. Any records missed would get new canonical IDs.
References to old identifiers break: Other systems (like METS) that reference Sierra B numbers would no longer find matches unless we also update those references.
Outcome: Rejected. This approach "works" in the narrow sense of preserving canonical IDs, but creates problems for maintaining alternative identifiers and loses valuable history about record provenance. It was considered as a fallback option if the many-to-one approach proved too complex.
Runtime ID matching in Minter
Approach: Allow a 1:many relationship between canonical IDs and source IDs, with the ID Minter determining the correct mapping at runtime:
When a new record arrives, the minter examines alternative identifiers in the record
If an alternative identifier matches an existing source ID in the database, use that canonical ID
Otherwise mint a new canonical ID
Concerns identified:
Minter needs source system knowledge: The minter would need to understand which alternative identifier to look up based on the source system, making it tightly coupled to transformer output format.
Duplicate handling: What happens if a record has multiple potential matches? Or if the alternative identifier value changes?
Combines downsides: Has the complexity of multiple approaches without their benefits — still requires alternative identifier handling in transformers, plus schema changes in the minter.
Outcome: Rejected. This approach was identified as having "all the downsides of both ideas" (transformer approach and many-to-one approach) without clear advantages. The cleaner solution is to handle predecessor relationships explicitly.
Why many-to-one was selected
The key insight from the discussion was that whatever approach we take, we probably want to remove the 1:1 relationship in the ID Minter. This is because:
We want to retain old identifiers (Sierra/CALM) as alternative identifiers on works, at least in the short term, since there will be references elsewhere (including physical pieces of paper).
If we retain the old identifiers as alternatives but don't update the ID Minter, we'd create new canonical IDs for them — essentially recreating the problem.
The ID Minter is the correct place to manage identifier mappings, not the transformer.
The many-to-one approach with predecessor relationships cleanly solves these issues while maintaining clear data provenance and supporting future migrations.
Implementation plan
This section outlines how the work can be split to enable parallel development and early release of changes before the source system migration.
Work streams
The implementation can be split into independent work streams that proceed in parallel:
Stream A: Step Functions migration
Move the existing Scala minter from SQS-based invocation to Step Functions for all transformer → ID Minter pairs.
Establishes the Step Functions infrastructure that the new Python minter will use
Reduces scope of the Python rewrite (no need to support SQS directly)
Provides a clean rollback path — can switch back to Scala minter if issues arise with Python version
Stream B: Database schema migration
Migrate the Aurora database to the new two-table schema while maintaining compatibility with the existing Scala minter. Set up a regular dump → load process to populate a test database with production data, enabling integration testing before cutover.
Deliverable: Database ready for new minter; ID pre-generation job running; test database with production data sync.
Stream C: ID generation logic
Implement and test the core minting logic in Python, independent of ES integration. Includes predecessor lookup, free ID claiming, fallback generation, and idempotent writes.
Deliverable: Tested Python module for ID minting operations.
Stream D: ES document parsing and annotation
Implement and test the Elasticsearch integration — reading from transformer index, extracting source identifiers, annotating documents with canonical IDs, writing to minted index.
Deliverable: Tested Python module for ES document handling.
Integration and cutover
Once streams A-D are complete:
Integrate Streams C and D into Lambda function
End-to-end testing with Step Functions
Shadow mode: run Python minter alongside Scala, compare outputs
Configure EventBridge Pipes for any remaining legacy transformers (TEI, METS)
Gradual rollout and Scala decommission
Migration support
Enable predecessor inheritance for Axiell Collections and Folio records. This work can be prepared in advance but should not be deployed until source system migration timing is confirmed.
Axiell Collections transformer: emit predecessor field for Sierra B numbers
Folio transformer: emit predecessor field for CALM reference numbers
Update merger precedence rules (Axiell > Sierra, Folio > CALM)
TL;DR
Problem: When we migrate records from Sierra/CALM to Axiell Collections/Folio, every record will get a new source identifier. The current ID Minter maps source identifiers to public catalogue identifiers 1:1, so every migrated record would receive a new public URL. This would break all existing links, citations, and bookmarks, require mass redirects, harm SEO, and fragment analytics.
Solution: Modify the ID Minter to allow multiple source identifiers to share the same public catalogue identifier. When transformers emit a "predecessor" field linking new Axiell/Folio records to their old Sierra/CALM identifiers, the minter will look up the predecessor's catalogue ID and assign it to the new record. This preserves URLs, requires no redirects for migrated records, and supports future source system migrations without further schema changes.
Implementation: The work splits into four parallel streams: (A) moving the existing Scala minter to Step Functions, (B) migrating the database schema to support many-to-one mappings with ID pre-generation, (C) implementing the new minting logic in Python, and (D) implementing Elasticsearch document parsing. Once integrated, the new Python Lambda replaces the Scala service. Transformer changes to emit predecessor fields are prepared separately and deployed when source system migration begins.
Last updated