Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
  • RFC 075: Using Apache Iceberg tables in Catalogue Pipeline adapters
Powered by GitBook
On this page
  • Context
  • Source Data
  • End Goal
  • Overview
  • The Concepts Ingest Pipeline
  • The Concept Aggregation Pipeline
  • The Concept Augmentation Pipeline
  • Update to the Catalogue Pipeline
  • Skeleton Delivery Plan
  • Other Considerations
  • Workflow Orchestration
  • ID Minter
  • Periodic or event-based?
  • Why not just use Wikidata?
  • When does a Work know
  • Data Sources
  • Further Work
  • Local Glossary
  • Authority
  • External Identifier
  • Source Document
  • Unidentified Concept

RFC 052: The Concepts Pipeline - phase one

Previous051-concepts-adaptersNextRFC 053: Logging in Lambdas

Last updated 10 days ago

This RFC describes the first phase of the Concepts Pipeline, which will be used to ingest and aggregate concepts.

Last modified: 2022-07-07T12:03:29+01:00

Context

Source Data

publish containing , which we will fetch and process.

Works in the Collection contain references to these identifiers, and also contain Unidentified Concepts that only have a name.

In the first phase, only LCSH will be fetched. Later, other sources will be included, and merged using Wikidata as a source of sameAs relationships between the terms.

End Goal

The result of running data through this pipeline is an index to serve the concepts API with concept data, harvested from an external authority (Initially Library of Congress, then MeSH, and Wikidata, with the possibility of adding other sources as we see fit).

Overview

In order to populate a new index with relevant concept data harvested from external sources, we will create three new concepts pipelines, and modify the Sierra transformer to ensure that Wellcome canonical ids are minted for all concepts in Works. We will also add a stage to the Catalogue Pipeline that uses the resulting data from this pipeline to update Works.

The Concepts Ingest Pipeline

This pipeline will be triggered

  • periodically, based on the update schedule of the source and how up-to-date we want each source to be

  • manually, if the update schedule is irregular, and we want to bring our data up-to-date immediately

It will:

  • fetch data from the source,

  • transform it to a common format

  • store that common form in a database

The result of this is a store (ES Index) containing all the concepts from all chosen external sources, in a common format. This will contain Concepts not in use at Wellcome.

This will not contain any Concepts that are exclusive to Wellcome (those identified by name only in Works).

Records will only contain the data provided by the external source, and will not have been embellished with Wellcome data in any way.

Records will contain:

  • the identifier,

  • the authority,

  • the primary name

  • a list of alternative names

The Concept Aggregation Pipeline

This pipeline will be triggered:

  • by the ingest of new Works by the Catalogue Pipeline

  • manually to ingest the data from all Works

It will:

  • extract any Concept Identifiers present on those Works

  • store them in a database

The result of this is a store (ES Index) containing all the concepts in use in Wellcome Collection Works.

This will contain a mixture of Concepts from external sources and those exclusive to The Collection.

Records will only contain records for those identifiers explicitly used in Works. The records will contain

  • the identifier,

  • the authority,

  • the name used for the concept on the Work

  • the corresponding Wellcome canonical identifier

The Concept Augmentation Pipeline

This pipeline will be triggered when either of the stores at the end of the Aggregation or Ingest pipelines change.

It will

  • combine corresponding records from the Aggregation and Ingest pipelines

  • from that, build the concepts store knowledge graph

    • At this point, there are no relations in the graph, this is future work

  • from that, populate the concepts index to be used by the Concepts API

Update to the Catalogue Pipeline

A new stage will be added, at a point after the Merger. This new stage will

  • Find Concepts mentioned in the Work

  • request Concept records for any of them from the Concepts API

  • Replace the data in the Concept objects in the Work with the data from the Concepts API

Skeleton Delivery Plan

  • The Ingest and Aggregation pipelines can be worked on independently in parallel at first.

  • To be complete, the Aggregation pipeline also requires changes to the Sierra Transformer (catalogue pipeline) so that identifiers are minted for unidentified concepts.

  • In the first phase, with only one source of external identifiers, the Augmentation pipeline should skip the knowledge graph, rather than creating a kind of graph-less graph.

  • Finally the new catalogue pipeline stage can be added.

    • It is also possible to work on adding this using the current dummy implementation of the API.

Other Considerations

Workflow Orchestration

These pipelines will be implemented using AWS Step Functions.

ID Minter

We do not need to provide our own identifiers for external concepts or identifiers that we do not use in works. This means that these pipelines do not need to interact with the ID Minter or its database. All Wellcome Canonical IDs used in the Augmentation pipeline will come from records in the Concept Aggregation pipeline.

Batch vs single-message

Concepts Aggregation will run in two distinct modes. A Batch mode that sees the whole set of Works and builds a list from that, and a single-message mode that adds any new Concepts found on the corresponding Work.

Periodic or event-based?

Originally, we discussed updating the Concept Aggregation Pipeline periodically based on the Works Snapshot.

If we drive it from Works ingest, then we do not need this periodic update.

With this approach, it may be the case that we end up with some concepts that have been deleted from Works left in the final index. However, deleting all references to a given concept is not likely to be a common occurrence, and the most appropriate resolution could be occasional manual refreshes in the same manner as reindexing the catalogue pipeline, or a periodic full refresh on a slow schedule.

Keeping up to date with deletions is less pressing than with new data.

The alternative is to also store the identifiers of all the Works that refer to a given Concept on the Concept record and remove the current Work's id if it no longer mentions that Concept. This is likely to be an inefficient solution for a rare occurrence.

Why not just use Wikidata?

As we intend to use Wikidata as the authoritative source for sameAs relationships, it may seem appropriate to use it as the source for all external ids. However, this is not the case.

We cannot use Wikidata alone, because:

  • It is not necessarily complete - there may be LCSH or MeSH (or other) terms that we use, but have not yet been added.

  • We also require other data (lists of alternative names, descriptions), which are exclusively published by the source authority and not duplicated elsewhere.

  • We want to use the label from LCSH as the preferred label for the concept. This is not always present on a Wikidata record.

  • The "never wrong for long" approach of a Wiki is OK for relationships, but relying on it for the preferred name of a concept may be problematic

When does a Work know

One of the goals of this product is to replace out-of-date concept names on Works that were catalogued a long time ago and where the official name of the concept has changed.

To do this, the current name corresponding to an identifier from the Source Authority should be used and stored on the Work to be returned by the Works API, in place of the name used in the original data from the catalogue.

We need to denormalise the concepts back on to the works in order to do this, once the data from the Authority has been processed.

Ideally, a relevant update in the Concepts pipeline would trigger concept changes in affected Works. However, once the application is in a steady state, such name changes are likely to be infrequent.

Because this is likely to be infrequent, the reprocessing can happen on the next catalogue pipeline reindex. This means that we will have to reindex after the concepts index is first populated, but after then, it can wait until the next catalogue reindex. This does mean that there could be a significant delay between an update to a name at an authority and the update to the name in Works, but if a particular change is deemed urgent, a manual reindex can be triggered.

We should log significant differences for reporting to cataloguing staff, in order to help keep the source records up to date with the latest names. The exact definition of significance can be explored in development, but is likely to involve ignoring terminal punctuation.

Data Sources

Further Work

The following aspects are not considered here.

  • Knowledge Graph construction

  • Synonyms and redirects

  • Composite Identifiers

Local Glossary

Authority

External Identifier

Source Document

Unidentified Concept

The Library of Congress publishes , implying that it would be sufficient to run the Ingest Pipeline for that source on a similar frequency.

monthly lists of changes
Overview diagram