Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Retrieving concepts from bulk exports
  • Transforming external concepts
  • Ingesting changes
  • Questions

RFC 051: Ingesting Library of Congress concepts

PreviousRFC 050: Design considerations for the concepts APINextRFC: 052: The Concepts Pipeline - phase one

Last updated 10 months ago

As per the high-level design of , one of the key areas of the concepts pipeline topology is an ingest section. This RFC outlines how this will be implemented for the first work phase (attaching only LoC concepts to works).

High level outline of the architecture in this RFC

Retrieving concepts from bulk exports

We can re-run the ingest process on a schedule to capture the (infrequent) changes to LoC concepts, which are almost always additions rather than changes/deletions.

In the architecture diagram above, the "adapter" and "transformer" are outlined as separate services. In practice, we can perform this mini-ETL operation more efficiently and simply as a streaming operation within one service when we use the gzipped JSON-LD files; these express "one concept per line" and so we needn't worry about traversing any trees or indeed caring much about the underlying RDF.

The "transform" step here is (a) the most ill-defined and (b) the only step that needs to differ for LCSH and LCNAF ingestors.

Transforming external concepts

Based on this, we might transform the LCSH documents to something like:

{
  "_id": "lc-subjects_sh12345678",
  "identifier": {
    "value": "sh12345678",
    "identifierType": "lc-subjects",
    "type": "Identifier"
  },
  "label": "<value of skos:prefLabel>",
  "alternativeLabels": [
    "<values of skos:altLabel>"
  ]
}

where the document _id is a QName-type identifier (but using an underscore instead of a slash so as not to cause headaches in ES).

We might also want to add some metadata about the source data provenance (eg the date that the dump is from).

Ingesting changes

This section is a bit speculative - we don't necessarily need the answers right now.

We can start by just writing all of the source into an empty index - this is inefficient and we won't know what (if anything) changed, but it will never be wrong.

We want to know what's changed in the source data because we want to trigger downstream activity based on updates. To know what's changed, we have to start comparing to what's there already.

This leaves us with detecting deletions: the least likely kind of change (in fact, we're not sure if it ever happens?). I don't think this needs to be addressed right now - if it's something we become aware of we can just write into an empty index again.

Questions

  • The format proposed above doesn't follow the existing schema for identifierType - rather than just the id we have previously used an object for identifierType, like

    {
      "id": "lc-subjects",
      "label": "Library of Congress Subject Headings (LCSH)",
      "type": "IdentifierType"
    }

    Is this OK? Yes, we prefer this

  • Is the non-default _id on the documents actually necessary? It's useful to be able to construct queries for these concepts, and the identifier proposed is scalable to various source schemas

LoC provide bulk exports of all (LCSH) and (LCNAF). These come in variety of flavours: in both MADS and SKOS vocabularies; and in N-Triple, JSON-LD, Turtle, and XML formats. All of the bulk exports are gzipped.

A more detailed view of how a concepts ingestor might work

outlined the rough design of a concepts API: obviously, we need to store the external concepts in a way that is sufficient to construct these API documents. We also don't want to extraneous data, as the majority of the concepts won't be used in the catalogue.

This is of course very inefficient, but we can pass that inefficiency onto the ES cluster by using the feature of the update API alongside the feature: with a single API call we'll be able to know whether anything was added.

subject headings
names
RFC 050
doc_as_upsert
noop
RFC 052