Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
  • RFC 075: Using Apache Iceberg tables in Catalogue Pipeline adapters
Powered by GitBook
On this page
  • RFC 051: Ingesting Library of Congress concepts
  • Context
  • Questions

051-concepts-adapters

PreviousRFC 050: Design considerations for the concepts APINextRFC 052: The Concepts Pipeline - phase one

Last updated 10 days ago

RFC 051: Ingesting Library of Congress concepts

This RFC outlines the design for the first phase of the concepts pipeline, specifically focusing on ingesting concepts from the Library of Congress (LoC) and preparing them for use in the Wellcome Collection catalogue.

Last modified: 2022-07-08T10:08:48+01:00

Context

As per the high-level design of , one of the key areas of the concepts pipeline topology is an ingest section. This RFC outlines how this will be implemented for the first work phase (attaching only LoC concepts to works).

High level outline of the architecture in this RFC

Retrieving concepts from bulk exports

We can re-run the ingest process on a schedule to capture the (infrequent) changes to LoC concepts, which are almost always additions rather than changes/deletions.

In the architecture diagram above, the "adapter" and "transformer" are outlined as separate services. In practice, we can perform this mini-ETL operation more efficiently and simply as a streaming operation within one service when we use the gzipped JSON-LD files; these express "one concept per line" and so we needn't worry about traversing any trees or indeed caring much about the underlying RDF.

The "transform" step here is (a) the most ill-defined and (b) the only step that needs to differ for LCSH and LCNAF ingestors.

Transforming external concepts

Based on this, we might transform the LCSH documents to something like:

{
  "_id": "lc-subjects_sh12345678",
  "identifier": {
    "value": "sh12345678",
    "identifierType": "lc-subjects",
    "type": "Identifier"
  },
  "label": "<value of skos:prefLabel>",
  "alternativeLabels": [
    "<values of skos:altLabel>"
  ]
}

where the document _id is a QName-type identifier (but using an underscore instead of a slash so as not to cause headaches in ES).

We might also want to add some metadata about the source data provenance (eg the date that the dump is from).

Ingesting changes

This section is a bit speculative - we don't necessarily need the answers right now.

We can start by just writing all of the source into an empty index - this is inefficient and we won't know what (if anything) changed, but it will never be wrong.

We want to know what's changed in the source data because we want to trigger downstream activity based on updates. To know what's changed, we have to start comparing to what's there already.

This leaves us with detecting deletions: the least likely kind of change (in fact, we're not sure if it ever happens?). I don't think this needs to be addressed right now - if it's something we become aware of we can just write into an empty index again.

Questions

  • The format proposed above doesn't follow the existing schema for identifierType - rather than just the id we have previously used an object for identifierType, like

    {
      "id": "lc-subjects",
      "label": "Library of Congress Subject Headings (LCSH)",
      "type": "IdentifierType"
    }

    Is this OK? Yes, we prefer this

  • Is the non-default _id on the documents actually necessary? It's useful to be able to construct queries for these concepts, and the identifier proposed is scalable to various source schemas

LoC provide bulk exports of all (LCSH) and (LCNAF). These come in variety of flavours: in both MADS and SKOS vocabularies; and in N-Triple, JSON-LD, Turtle, and XML formats. All of the bulk exports are gzipped.

A more detailed view of how a concepts ingestor might work

outlined the rough design of a concepts API: obviously, we need to store the external concepts in a way that is sufficient to construct these API documents. We also don't want to extraneous data, as the majority of the concepts won't be used in the catalogue.

This is of course very inefficient, but we can pass that inefficiency onto the ES cluster by using the feature of the update API alongside the feature: with a single API call we'll be able to know whether anything was added.

subject headings
names
RFC 050
doc_as_upsert
noop
RFC 052