RFC 051: Ingesting Library of Congress concepts
Last updated
Last updated
As per the high-level design of RFC 052, one of the key areas of the concepts pipeline topology is an ingest section. This RFC outlines how this will be implemented for the first work phase (attaching only LoC concepts to works).
LoC provide bulk exports of all subject headings (LCSH) and names (LCNAF). These come in variety of flavours: in both MADS and SKOS vocabularies; and in N-Triple, JSON-LD, Turtle, and XML formats. All of the bulk exports are gzipped.
We can re-run the ingest process on a schedule to capture the (infrequent) changes to LoC concepts, which are almost always additions rather than changes/deletions.
In the architecture diagram above, the "adapter" and "transformer" are outlined as separate services. In practice, we can perform this mini-ETL operation more efficiently and simply as a streaming operation within one service when we use the gzipped JSON-LD files; these express "one concept per line" and so we needn't worry about traversing any trees or indeed caring much about the underlying RDF.
The "transform" step here is (a) the most ill-defined and (b) the only step that needs to differ for LCSH and LCNAF ingestors.
RFC 050 outlined the rough design of a concepts API: obviously, we need to store the external concepts in a way that is sufficient to construct these API documents. We also don't want to extraneous data, as the majority of the concepts won't be used in the catalogue.
Based on this, we might transform the LCSH documents to something like:
where the document _id
is a QName-type identifier (but using an underscore instead of a slash so as not to cause headaches in ES).
We might also want to add some metadata about the source data provenance (eg the date that the dump is from).
This section is a bit speculative - we don't necessarily need the answers right now.
We can start by just writing all of the source into an empty index - this is inefficient and we won't know what (if anything) changed, but it will never be wrong.
We want to know what's changed in the source data because we want to trigger downstream activity based on updates. To know what's changed, we have to start comparing to what's there already.
This is of course very inefficient, but we can pass that inefficiency onto the ES cluster by using the doc_as_upsert
feature of the update
API alongside the noop
feature: with a single API call we'll be able to know whether anything was added.
This leaves us with detecting deletions: the least likely kind of change (in fact, we're not sure if it ever happens?). I don't think this needs to be addressed right now - if it's something we become aware of we can just write into an empty index again.
The format proposed above doesn't follow the existing schema for identifierType
- rather than just the id we have previously used an object for identifierType
, like
Is this OK? Yes, we prefer this
Is the non-default _id
on the documents actually necessary? It's useful to be able to construct queries for these concepts, and the identifier proposed is scalable to various source schemas