Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
Powered by GitBook
On this page
  • Table of contents
  • Background information
  • Existing endpoints and indexes
  • Elasticsearch "All" index
  • Addressable content types
  • Query objects alignment
  • Indexing
  • Diagram of ETL (extract, transform, load) pipeline
  • Mapping
  • Content API response
  • Catalogue search
  • Works
  • Images

RFC 062: Content API: All search and indexing of addressable content types

PreviousRFC 061: Content API next stepsNextRFC 062: Wellcome Collection Graph overview and next steps

Last updated 2 days ago

Searching for content on wellcomecollection.org is currently split into separate, statically-ordered grids for Stories, Works, Images and Events. This RFC proposes a new "All" search endpoint that will return all Addressable content types in a single, ordered list, improving efficiency and relevance.

Last modified: 2024-11-18T10:11:11+00:00

Table of contents

Background information

The current "All" search (wellcomecollection.org/search) displays separate, statically-ordered grids for Stories, Works, Images and Events. In doing so, we are unwillingly creating a hierarchy of importance between those content types which does not match their actual level of relevance. Each grid also requires its own query call, which is not efficient.

As a next step, we are looking at making the "All" search expose all Prismic content types whose documents are available to our users through a UID-based URL (), with the results being ordered by each individual document's relevance score. Works and Images ("Catalogue search") will also still be available on the page, although their relevance scores will not be weighed against Addressable content types', as you can see on the image below.

There is to be no filtering nor sorting feature on this page. Therefore, we aim to build something with minimalism in mind, allowing us to have query performance at the forefront of our concerns.

We will do so by creating a new endpoint: https://api.wellcomecollection.org/content/v0/all

Existing endpoints and indexes

TLDR; This won't affect them in any way.

We have wondered if this new endpoint removed the need for our existing, specialist ones (https://api.wellcomecollection.org/content/v0/articles, for example). Could we only use this new one and use a filter when needed? We have determined that the answer was no, as they serve a different purpose.

As the new endpoint and index are to be as minimalistic as possible, these "specialist" ones will still be the ones used in Content type-specific listing pages (wellcomecollection.org/stories) or search (wellcomecollection.org/search/articles), as they allow us to provide much more complex information, such as filters and aggregations.

Elasticsearch "All" index

We will be creating a single index in Elasticsearch containing all Addressable content types in their most minimalistic form.

Addressable content types

Here is a list of which Prismic content types we consider to be Addressable, in that their documents are all accessible to our users under a UID-based URL.

Exhibition highlight tour

Query objects alignment

Something that will help the search performance would be to have as little fields to look through as possible, and have their names match across content types. We suggest:

query: {
  type: string,
  title: string,
  description: string,
  body?: string,
  contributors?: string[]
}

Should we want any other field to be queriable (such as "Format" for Projects), we will append them to one of the above, based on how we want that field to score. The only other one worth discussing is the description field:

Description, captions, standfirsts and intro texts

An audit of all such fields will be done as a separate ticket: https://github.com/wellcomecollection/wellcomecollection.org/issues/11401, so all references to such fields in the transformed objects should be taken with this in mind.

Indexing

Diagram of ETL (extract, transform, load) pipeline

Mapping

Content API response

We have decided not to worry about a default order for a queryless call to the endpoint, as it might have nothing to do with the Content API and be rendered through Prismic content or static code. We are therefore taking that out of the scope of this RFC.

{
  type: "ResultList",
  results: [
    {
      type: "Event",
      id: "WwQHTSAAANBfDYXU",
      uid: "lorem-ipsum",
      title: "Lorem ipsum dolor sit amet",
      description: "Aliquam erat volutpat."
    },
    {
      type: "Visual story",
      id: "ZdTCPREAACEA3zK4", 
      uid: "jason-and-the-adventure-of-254-visual-story",
      title: "Jason and the adventure of 254 visual story",
      description: "Aliquam erat volutpat"
    },
    ...
  ],
  pageSize: 10,
  totalPages: 49,
  totalResults: 482,
  nextPage: "https://api.wellcomecollection.org/content/v0/all?page=2",
}

Catalogue search

Works

Works will be represented by their workType (formats) being listed under a "Catalogue results" heading. To render the UI, we only will need:

  • label

  • count

  • id (for linking to a pre-filtered works search)

  • totalResults count

The required fields can be taken from the Catalogue API reponse's aggregations' workType buckets: https://api.wellcomecollection.org/catalogue/v2/works?aggregations=workType&include=languages&pageSize=1 (adding a query keyword to the params should one be entered).

As the workType bucket is the only thing we really need from the response (with the totalResults count), I tried to tweak the query to be as simple as possible (e.g. adding an include param limits the results objects), suggestions welcome.

Images

For the Images results, we need the first 5 results and the totalResults count. We will use the Catalogue API's image endpoint: https://api.wellcomecollection.org/catalogue/v2/images?pageSize=5, adding a query param should one be entered.

.

Its response will return , as well as everything we need for pagination.

This list also link to a file which describes what they are to look like in the Elasticsearch index. You may consult instead.

Events:

Exhibitions:

Stories:

Pages:

Visual stories:

Exhibition text:

Exhibition highlight tour: . This document gets transformed into two different ones, consult below.

Books:

Projects:

Seasons:

This document is a special case, in that it is one Prismic document that needs to be indexed as two documents: "Audio with transcripts" and "British sign language with subtitles", as they are two different pages on the website ( and ).

We have built our content types to use an array of fields to serve the same purpose; what could be called a "description" of the document gets called "Promo caption", "standfirst" (which is a slice, so part of the body), or "Intro text". There is , but in the meantime, we suggest we use only one name for these in the index: description. We will need to determine which content type should use which field as a description, but once that gets indexed, it becomes much easier to reference it by one name, at least in the "display" object.

. I've gone with what we have on our other indices, although I'm sure they could have different parameters, they have served us well so far. Any improvement suggestions welcome.

As per , we will be fetching the Catalogue information from the Catalogue API, asynchronously, from the client. This will allow for a separation of concerns should one of the services be unhealthy.

Consult the design prototype here
the complete list here
Transformed indexed Event example
Transformed indexed Exhibition example
Transformed indexed Story example
Transformed indexed Page example
Transformed indexed Visual story example
Transformed indexed Exhibition Text example
Transformed indexed Book example
Transformed indexed Project example
Transformed indexed Season example
Audio with transcripts
British sign language with subtitles
a ticket which aims to address the case of the Standfirst slices
See our suggestion for the mapping here
this conversation
Background information
Elasticsearch "All" index
Addressable content types
Query objects alignment
Indexing
ETL diagram
Mapping
Content API response
Catalogue search
"Addressable content types"
an ordered list of Addressable content types
Transformed indexed Exhibition Highlight examples
Exhibition highlight tour section