Catalogue pipeline
  • Introduction
  • Fetching records from source catalogues
    • What is an adapter?
    • CALM: Our archive catalogue
    • MIRO: Our image collections
  • Transforming records into a single, common model
    • Our single model: the Work
    • Creating canonical identifiers
  • Combining records from multiple sources
    • Why do we combine records?
    • How we choose which records to combine
  • Other topics
    • Catalogue
    • Search
      • wellcomecollection.org query development index
      • Hypotheses
        • Concepts, subjects
        • Contributors
        • Titles
        • Genres
        • Reference numbers
        • Synonymous names and subjects
        • Mood
        • Phrases
        • Concepts, subjects with other field
        • Contributor with other field
        • Title with other field
        • Genre with other field
        • Reference number with other field
        • Behaviours
        • Further research and design considerations
      • Analysis
        • Less than 3-word searches
        • Searches with 3 words or more
        • Subsequent searches
      • Query design
      • Relevance tests
        • Test 1 - Explicit feedback
        • Test 2 - Implicit feedback
        • Test 3 - Adding notes
        • Test 4 - AND or OR
        • Test 5 - Scoring Tiers
        • Test 6 - English tokeniser and Contributors
        • Test 7 - BoolBoosted vs ConstScore
        • Test 8 - BoolBoosted vs PhaserBeam
      • Collecting data
      • Reporting and metrics
      • Work IDs crib sheet
    • Adapters
      • Adapter lifecycle
      • Fetching records from Sierra
    • Sierra
      • Sierra IDs
    • Pipeline
      • Merging
    • APM
Powered by GitBook
On this page
  • The format of canonical IDs
  • How we create canonical IDs from source identifiers
  1. Transforming records into a single, common model

Creating canonical identifiers

Our source catalogues all use their own identifier schemes, which may have differently formatted or overlapping values.

For example, the identifier b14561980 could refer to:

  • a bib record for a book in Sierra, or

  • the METS file for a digitised copy of that book

To help us to refer to entities consistently and unambiguously, the catalogue pipeline creates its own "canonical" identifiers. These are distinct from identifiers in the source catalogue, and used whenever we need to refer to specific records, e.g. in URLs.

The format of canonical IDs

Canonical IDs are chosen for the following properties:

  • They should be short, ideally something that fits on a single post-it note

  • They should be unambiguous, so they don't use characters which can look ambiguous (e.g. letter O and numeral 0)

  • They should be URL safe

How we create canonical IDs from source identifiers

There's a one-to-one relationship between canonical IDs and source identifiers.

A source identifier is made of three parts:

  • The source system of the identifier, e.g. miro-image-number or sierra-system-number. This helps us distinguish between overlapping identifiers from different systems, like the b14561980 example above.

  • The value of the identifier, e.g. b14561980 or V0000287.

  • The ontology type of the identifier. This helps us create different canonical IDs for different contexts. e.g. Miro identifiers are used for both a Work and an Image, and we want the Work and the Image to have different canonical IDs. The Miro source identifiers will have ontology types Work and Image, respectively, so they get distinct canonical IDs.

PreviousOur single model: the WorkNextWhy do we combine records?

Last updated 2 years ago