Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Context: the desired behaviour
  • The current implementation, and the problems it poses
  • Proposed solution

RFC 049: Changing how aggregations are retrieved by the Catalogue API

PreviousRFC 048: Concepts work planNextRFC 050: Design considerations for the concepts API

Last updated 10 months ago

For , we changed the catalogue API to serialise public API responses from an opaque display field in the Elasticsearch documents.

Previously we were storing the pipeline's internal model in Elasticsearch. The API would retrieve the internal model, parse it, and convert it into the display model -- all as part of serving the request. Now, the pipeline creates the display model and stores it in a new display field in Elasticsearch. The API retrieves the display field, and treats it as opaque JSON -- without any knowledge of its structure.

We'd hoped this would allow us to remove the internal/display models from the API repos, but the current implementation of aggregations make this tricky.

This RFC proposes a change to how aggregations are handled that should remove this obstacle.

Context: the desired behaviour

Clients of the catalogue API can request aggregations of certain values (e.g. languages, licenses, work types).

These are presented as a list of AggregationBucket in the API responses, i.e.:

AggregationBucket[T] {
  data: T
  count: Int
  type: String
}

where T is the same as the value appearing in the Work model.

For example, languages on a work appear as follows:

{
  "languages": [
    {
      "id": "eng",
      "label": "English",
      "type": "Language"
    },
    {
      "id": "fre",
      "label": "French",
      "type": "Language"
    }
  ],
  ...
}

and we see that Language type in the aggregation buckets:

"aggregations": {
  "languages": {
    "buckets": [
      {
        "data": {
          "id": "eng",
          "label": "English",
          "type": "Language"
        },
        "count": 691840,
        "type": "AggregationBucket"
      },
      {
        "data": {
          "id": "fre",
          "label": "French",
          "type": "Language"
        },
        "count": 67187,
        "type": "AggregationBucket"
      },
...

This presents a clean, consistent interface to clients – a value looks the same whether it's in a work or in an aggregation.

We don't want to change this behaviour.

The current implementation, and the problems it poses

GET /works-indexed-2022-04-28/_search
{
  "aggs": {
    "languages": {
      "terms": { "field": "data.languages.id" }
    }
  }
}

The Elasticsearch API response can only use a single string in its buckets, for example:

{
  "aggregations" : {
    "languages" : {
      "buckets" : [
        {
          "key" : "eng",
          "doc_count" : 691840
        },
        {
          "key" : "fre",
          "doc_count" : 67187
        },
        ...

The API has to contain enough internal/display model logic to interpret these values -- to know that, say, eng means English and it's serialised as id/label/type. This is precisely the sort of model coupling we're trying to get away from.

For more complex types, we've had to jump through hoops to shoehorn aggregations into this approach -- e.g. contributor values are stored like person:Henry Wellcome because we need both the type and the label to return them correctly.

It would be nice if we could remove this coupling and simplify how aggregations work in the API.

Proposed solution

We add a new field query.aggregatableValues to the documents we store in Elasticsearch.

type Query {
  aggregatableValues: Map[String, List[String]]
  ...
}

where the keys are the aggregation types (languages, work types, licenses) and the values are lists of display JSON stored as strings.

This is easiest to understand with an example:

{
  "id": "example-work",
  "query": {
    "aggregatableValues": {
      "languages": [
        " { \"id\" : \"eng\", \"label\": \"English\", \"type\": \"Language\" } ",
        " { \"id\" : \"fre\", \"label\": \"French\", \"type\": \"Language\" } "
      ],
      "items.locations.license": [
        " { \"id\": \"pdm\", \"label\": \"Public Domain Mark\", \"url\": \"https://creativecommons.org/share-your-work/public-domain/pdm/\", \"type\": \"License\" } "
      ],
      ...
    }
  },
  ...
}

The ingestors would populate these aggregatableValues fields when it indexed a work. This would be mapped as a keyword field in Elasticsearch.

aggregatableValues = {
    languages: [
        lang.to_display_json().as_string()
        for lang in work.languages
    ],
    items.locations.license: [
        license.to_display_json().as_string()
        for license in items.locations
    ],
    ...
}

The API would aggregate over these fields specifically. The Elasticsearch terms aggregation would return something like:

{
  "aggregations" : {
    "languages" : {
      "buckets" : [
        {
          "key" : " { \"id\" : \"eng\", \"label\": \"English\", \"type\": \"Language\" } ",
          "doc_count" : 691840
        },
        {
          "key" : " { \"id\" : \"fre\", \"label\": \"French\", \"type\": \"Language\" } ",
          "doc_count" : 67187
        },
        ...

and the API would unpack the keys as opaque JSON objects, and pass the value into its response:""

# api

displayBuckets = [
    {
        data: parse_json(bucket.key),
        count: bucket.count,
        type: "AggregationBucket"
    }
    for bucket in es_aggregations_response
]

This would allow us to reduce the amount of model logic in the API, and would ensure a consistent rendering of values in aggregations and works.

The API uses to aggregate over a field in internal model. Elasticsearch will return single string values, which the API then serialises into the display model.

For example, our language aggregation starts as over the data.languages.id field:

RFC 047
Elasticsearch aggregations
an ES terms aggregation