Creating canonical identifiers

Our source catalogues all use their own identifier schemes, which may have differently formatted or overlapping values.

For example, the identifier b14561980 could refer to:

a bib record for a book in Sierra, or
the METS file for a digitised copy of that book

To help us to refer to entities consistently and unambiguously, the catalogue pipeline creates its own "canonical" identifiers. These are distinct from identifiers in the source catalogue, and used whenever we need to refer to specific records, e.g. in URLs.

The format of canonical IDs

Canonical IDs are chosen for the following properties:

They should be short, ideally something that fits on a single post-it note
They should be unambiguous, so they don't use characters which can look ambiguous (e.g. letter O and numeral 0)
They should be URL safe

How we create canonical IDs from source identifiers

There's a one-to-one relationship between canonical IDs and source identifiers.

A source identifier is made of three parts:

The source system of the identifier, e.g. miro-image-number or sierra-system-number. This helps us distinguish between overlapping identifiers from different systems, like the b14561980 example above.
The value of the identifier, e.g. b14561980 or V0000287.
The ontology type of the identifier. This helps us create different canonical IDs for different contexts. e.g. Miro identifiers are used for both a Work and an Image, and we want the Work and the Image to have different canonical IDs. The Miro source identifiers will have ontology types Work and Image, respectively, so they get distinct canonical IDs.

PreviousOur single model: the Work NextWhy do we combine records?

Last updated 2 years ago