Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
Powered by GitBook
On this page
  • Context
  • Proposal
  • Questions and potential issues

RFC 056: Prismic to Elasticsearch ETL pipeline

PreviousRFC 055: Genres as ConceptsNextRFC 058: Relevance testing

Last updated 4 days ago

This RFC proposes a mechanism for extracting data from Prismic, transforming it, and loading it into Elasticsearch to make our editorial content more discoverable via an API.

Last modified: 2023-03-02T11:39:12+00:00

Context

In order to make our editorial content - including stories, comics, exhibitions and events - more discoverable, we want to be able to search it via an API as described in .

While Prismic does provide some search functionality of its own (which we have been using for initial versions of unified search), we want more control and fewer limitations. To achieve this we want to use Elasticsearch, as we do for our other search services. As such, we need to get data from Prismic into Elasticsearch: this RFC will propose the mechanism by which we achieve that.

Desiderata

  • Changes (including additions and edits but possibly not deletions - see ) in Prismic are reflected promptly in the Elasticsearch index

  • Full reindexes are easy, quick and cheap to perform

  • Changes to the data mapping (and the index mapping) can be made easily by any developer

Prior art

Both the catalogue pipeline and the concepts pipeline extract data from external sources, transform it, and load it into elasticsearch. They have similar architectures:

Existing pipeline architecture

The adapter/ingestor stages are separate from the transformer/aggregator stages to separate the (often complex!) concern of getting data out of external APIs and that of selecting which parts of the data we want and transforming them.

Most of the adapters receive updates by polling the source APIs at regular intervals for documents updated in the intervening time period. Where deletions are indicated only by the absence of the data from subsequent requests to the source, we run additional "deletion checker" services which run over all currently stored records and delete them if they're missing from the source.

Proposal

A basic example of this would be that the article type has a contributors field, which links to role and person types:

{
  ...
  "contributors": {
    "role": {
      "id": "<foreign key>",
      "type": "editorial-contributor-roles",
      ...
    },
    "person": {
      "id": "<foreign key>",
      "type": "people",
      ...
    }
  }
}

The proposed solution to this is to store the IDs of the secondary documents alongside the other information we index in Elasticsearch for primary documents:

{
  "display": {
    // Opaque JSON to be displayed by the API
  },
  "query": {
    // Fields for querying, filtering etc
    ...
    // List of identifiers (foreign keys) of linked documents
    "ids": [
      "1",
      "2",
      "3"
    ]
  }
}

Then we can build a pipeline that works as follows:

  1. A 'window generator' Lambda triggered on schedule, which generates a payload representing a time period/window that is sent to (2).

  2. A 'Prismic ETL' Lambda which consumes time periods (potentially half-bounded or unbounded, for complete reindexes) and then:

    1. Queries Prismic for all documents (including denormalised data on primary documents) updated within the time window.

    2. For all secondary documents, query the ES index for already-indexed primary documents that contain them.

    3. Queries Prismic for all the documents (including denormalised data) from (ii) that are not part of the list returned by (i).

    4. Transforms the resultant primary documents into JSON objects as described above.

    5. Indexes these into an Elasticsearch cluster using the Elasticsearch JS client's bulk helpers.

For complete reindexes, it would be straightforward to trigger the Prismic ETL lambda from a local script, with a payload that covers all documents. In this case, it would also be an easy optimisation to disable steps (ii) and (iii), as all documents would be being fetched regardless.

Technical implementation points

Questions and potential issues

    • Reliability issues: if a single update is missed because of bugs in our services or problems on Prismic's side, we have no way of knowing that we missed it.

    • Not useful for complete reindexes: with a similar implementation effort (due to the necessity of storing relationships between primary and secondary documents), the webhook solution does nothing to solve the problem of complete reindexes. We would have to build a service or script to scroll over every document and pass every identifier to the webhook service, which would be both inefficient and time-consuming to build.

The Prismic corpus is smaller than our other corpora, and this ETL pipeline is very linear and self-contained. However, Prismic data is fairly heavily and so we need to build a solution that can (a) denormalise data from linked documents onto our "primary" documents and (b) reflect changes from these linked documents ("secondary" documents) on all of the primary documents on which they are present.

Fortunately, Prismic provide an API, which can do this denormalisation for us, using a GraphQL-like syntax. This straightforwardly solves the first problem of denormalising linked data, but not the second problem of reflecting changes in it.

The intention is that the Prismic ETL Lambda will be written in TypeScript for maintainability, but one disadvantage of this is that we lose some of the patterns/tools that Scala gives us for reactive streaming data pipelines. Suggest we try for this purpose.

What do we do about deletions? Do we know if they happen? We could write something similar to the if necessary.Initial answer: having checked with the editorial team, deletions (and/or archivals) very rarely happen with our content. I suggest we hold off on solving this for our initial efforts, especially given our intention to make full reindexes easy.

What about the rather than polling to detect updates?Initial answer: while this has some clear advantages (immediacy of updates being one), I decided against it for the following reasons:

normalised
graphQuery
RxJS
CALM deletion checker
Prismic webhook
RFC 055
questions