Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • The problem
  • How it is
  • Proposal
  • Why merge and split?
  • Why use an ingest pipeline?
  • Even better/eventually
  • Out of scope
  • How to find works
  • Choosing a "preferred" Concept

RFC 054: Authoritative ids with multiple Canonical ids.

PreviousRFC 053: Logging in LambdasNextRFC 055: Genres as Concepts

Last updated 10 months ago

The problem

Some authoritative ids (including label-derived ones) correspond to more than one Concept in the Works catalogue. It is therefore not possible to reliably link by id between a work and a concept because the canonical id used in this work for a given concept is not the same as the one used in that work.

The root cause of this is that the ontologyType of a concept forms part of the key used to mint a canonical id, and the ontologyType of a concept is determined from the MARC field it comes from in a Sierra document.

Further, the Concepts pipeline currently assumes a 1:1 relationship between a Concept and an authoritative id. This causes some expected Concepts to be absent from the concepts API.

Finally, this blocks the implementation of Genres as Concepts, because we cannot reconcile genre-as-a-subject with genre-of-a-work whilst also marking genres with a distinct ontologyType.

How it is

The Concepts Aggregator extracts Concepts from Works, and uses to store them, keyed on the authoritative id, in the catalogue-concepts index. The Recorder then reconciles these records with the corresponding official records and stores the combined record in the concepts-store index, keyed on the canonicalId.

Proposal

The Aggregator will extract Concepts in the same way, but the bulk command will now include with to collect ids and types.

The Recorder will then create a record for each canonicalId in the list, choosing the "best" ontologyType and applying it to all of the output records.

The "best" ontologyType is the most specific. Concept < Agent < Everything else. This hierarchy is currently in use to choose the best type when the same authoritative id occurs in multiple places in a single document. There may be some conflicts where multiple types of the same specificity are present on the same catalogue concept, but this is unlikely to occur, adn if it does, then it is likely to be an error in the source data.

This proposal also allows us to start considering same-as relationships in the Concepts API. The entries in the concept store can contain a list of all the canonicalids of other concepts with the same authoritative id (possibly also including its own).

Why merge and split?

The alternative to merging and splitting would be to have a separate record for each authoritative/canonical id pair in catalogue concepts.

For each authoritative id, the Recorder currently fetches one Concept Record by id. Changing it to search for multiple records would be a significant change, whereas the change to the output is much less extreme.

Why use an ingest pipeline?

The alternative is for the Aggregator to first fetch any records it will overwrite, then populate the id and type members accordingly, deduplicating members of the list etc. All adding significant complexity.

An append processor will do this declaratively and efficiently inside the database.

Even better/eventually

Ideally, the root cause of this should be fixed, by removing ontologyType from the id minter. However, that is a very complex change and we would still need to take an approach like the one proposed here to collect the ontologyTypes and choose the most appropriate one.

Out of scope

How to find works

This proposal does not consider exactly how the API, search, or Concepts pages will make use of the new sameAs data.

Essentially, this behaviour will be expected:

GIVEN two synonymous concepts, 'abc123' and 'def789'
WHEN works containing 'abc123' are requested
THEN works containing either 'abc123' and 'def789' are returned

Similarly, the concept pages for the two identifiers will be identical, whether by redirection or by virtue of containing the same data.

Choosing a "preferred" Concept

Eventually, this will be required for the Relation Embedder, which will replace synonymous Concepts in Works with a single preferred Concept.

This may also be required in order to create redirects for the Concepts API and pages. There are no real-world criteria with which to select a blessed Concept as they are all the same.

Eventually, we may need a persistent store to record the chosen preferred concepts.

the bulk API
an ingest pipeline
append processors