RFC 054: Authoritative ids with multiple Canonical ids.
Last updated
Last updated
Some authoritative ids (including label-derived ones) correspond to more than one Concept in the Works catalogue. It is therefore not possible to reliably link by id between a work and a concept because the canonical id used in this work for a given concept is not the same as the one used in that work.
The root cause of this is that the ontologyType of a concept forms part of the key used to mint a canonical id, and the ontologyType of a concept is determined from the MARC field it comes from in a Sierra document.
Further, the Concepts pipeline currently assumes a 1:1 relationship between a Concept and an authoritative id. This causes some expected Concepts to be absent from the concepts API.
Finally, this blocks the implementation of Genres as Concepts, because we cannot reconcile genre-as-a-subject with genre-of-a-work whilst also marking genres with a distinct ontologyType.
The Concepts Aggregator extracts Concepts from Works, and uses the bulk API to store them, keyed on the authoritative id, in the catalogue-concepts index. The Recorder then reconciles these records with the corresponding official records and stores the combined record in the concepts-store index, keyed on the canonicalId.
The Aggregator will extract Concepts in the same way, but the bulk command will now include an ingest pipeline with append processors to collect ids and types.
The Recorder will then create a record for each canonicalId in the list, choosing the "best" ontologyType and applying it to all of the output records.
The "best" ontologyType is the most specific. Concept < Agent < Everything else. This hierarchy is currently in use to choose the best type when the same authoritative id occurs in multiple places in a single document. There may be some conflicts where multiple types of the same specificity are present on the same catalogue concept, but this is unlikely to occur, adn if it does, then it is likely to be an error in the source data.
This proposal also allows us to start considering same-as relationships in the Concepts API. The entries in the concept store can contain a list of all the canonicalids of other concepts with the same authoritative id (possibly also including its own).
The alternative to merging and splitting would be to have a separate record for each authoritative/canonical id pair in catalogue concepts.
For each authoritative id, the Recorder currently fetches one Concept Record by id. Changing it to search for multiple records would be a significant change, whereas the change to the output is much less extreme.
The alternative is for the Aggregator to first fetch any records it will overwrite, then populate the id and type members accordingly, deduplicating members of the list etc. All adding significant complexity.
An append processor will do this declaratively and efficiently inside the database.
Ideally, the root cause of this should be fixed, by removing ontologyType from the id minter. However, that is a very complex change and we would still need to take an approach like the one proposed here to collect the ontologyTypes and choose the most appropriate one.
This proposal does not consider exactly how the API, search, or Concepts pages will make use of the new sameAs data.
Essentially, this behaviour will be expected:
Similarly, the concept pages for the two identifiers will be identical, whether by redirection or by virtue of containing the same data.
Eventually, this will be required for the Relation Embedder, which will replace synonymous Concepts in Works with a single preferred Concept.
This may also be required in order to create redirects for the Concepts API and pages. There are no real-world criteria with which to select a blessed Concept as they are all the same.
Eventually, we may need a persistent store to record the chosen preferred concepts.