Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Proposal
  • Implementation
  • Future work
  • Open questions

RFC 047: Changing the structure of the Catalogue API index

PreviousTransitive hierarchies in SierraNextRFC 048: Concepts work plan

Last updated 10 months ago

We use the Elasticsearch index for the catalogue API for several purposes:

  • To serialise the public API responses

  • To index and search documents

  • To help debug the API and the pipeline

Currently the documents in this index are a serialisation of the "internal model", which has to support all of these uses. This causes a number of problems: When the API returns a response, it converts the internal model into a display model. This creates a strong coupling between the internal model and the API, which has been a long-running source of complexity.

This RFC proposes a new structure for the Catalogue API index which should remove this coupling.

Proposal

We restructure documents in the API index to have three top-level fields:

  • The display field contains the complete display document as a block of JSON. The API will return the contents of this field in public responses.

    This field is mapped in the Elasticsearch index as an object field with , meaning Elasticsearch will ignore it for indexing.

  • The query field contains the values that we're indexing, e.g. work title. This will contain a subset of the work/image data that is indexed and analysed by Elasticsearch.

    This field must be consistently defined between the pipeline and the API, or values won't be in the right place for queries.

  • The debug field contains the values that we use for debugging the pipeline, e.g. the date a document was indexed.

    This field should only contain information that the API can ignore.

Implementation

We can add these fields progressively, rather than in one massive update. This is a rough approach, which we could do for all three top-level fields separately:

  1. We copy the display models into the pipeline repo, and modify the ingestor to store these new fields. This is a strictly additive step.

  2. We reindex the pipeline to add this field to all documents (this could be a new pipeline, or we could do it in-place in an existing pipeline).

  3. We update the catalogue API to use the new fields for public API responses, rather than handling its own models.

  4. We remove fields from the indexed Work model that aren't used for indexing/debugging.

Future work

  • Rewrite the API in TypeScript. This is explicitly out of scope here – let's not try to change too much at once.

    Although this change opens the door to a TypeScript-based API, let's stabilise the index structure before we start changing the API.

Open questions

  • Are these the best names for these fields?

  • Currently the API will check for internal model compatibility before it starts. Do we still want equivalent behaviour with index mappings?

    Because this change is meant to decouple the internal model and the API, I think we could get away with scrapping it for now – and bringing it back if and only if we see issues, rather than converting it to use index mappings pre-emptively.

  • Can we expunge the internal model library from the API repo entirely?

    We won't be using it in the application code, but it does have Work generators and index mappings that we use extensively in API tests. I think we have to keep it, but if we could get rid of it we'd get to simplify some build processes.

enabled=false