Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Background
  • Principles
  • Open questions

API faceting principles & expectations

Status: Draft

Last updated: 16/03/2021

Background

We've had a few conversations that keep coming up around how we expect the API to behave, particularly in terms of:

  • Filter naming

  • Aggregation naming

  • Combined effects of filters/aggregations

  • Aggregation response types

  • Empty aggregation buckets

These are the components required to build a useful faceted search interface which covers numerous dimensions: so that we can do this effectively, we want these expectations to be both explicit and adhered to. This RFC is an attempt to document, question, and codify those expectations.

Principles

1. Filters are named by the JSON paths of the identified object that they filter or, if applied to an attribute other than the identifier, the path of that attribute

For example, given a display document (ie, one of the JSON entities returned by the API) that looks like:

{
  "a": {
    "b": [
      {
        "id": "id1",
        "label": "Thing 1"
      },
      {
        "id": "id2",
        "label": "Thing 2"
      }
    ]
  }
}

Then a document filter that filtered by the identifiers of the objects in b would use a query string like a.b, for example:

http://host.name/path/docs?a.b=id1

If the filter applied to the label attribute rather than the identifier, it would look like:

http://host.name/path/docs?a.b.label=Thing%202

2. Aggregations are always paired with identically named filters

It is not strictly necessary that all filters have aggregations, but all aggregations must be present alongside an identically named filter for the property that is being aggregated upon - as for a faceted search interface, the primary purpose of aggregations is to allow for further filtering. This document will refer to these as "paired" filters and aggregations. For the above example, an aggregation on the identified objects in b would be:

http://host.name/path/docs?aggregations=a.b

and an aggregation on the labels would be:

http://host.name/path/docs?aggregations=a.b.label

3. Aggregations are returned in an aggregations field, with the same name by which they were requested

This means JSON paths are still represented as strings, rather than being expanded. For example, the response to the previous example would include at the top level

{
  ...,
  "aggregations": {
    "a.b": {
       "buckets": [
         ...
       ]
    }
  }
}

4. Aggregation buckets contain a data field of the same type as the aggregated object

That is to say, when we aggregate on a string field (for example a label), we want to return the full entity that contains the field. For the example above, if we aggregate on the labels like this:

http://host.name/path/docs?aggregations=a.b.label

Then our response buckets will look something like this:

{
  "data": {
    "id": "id1",
    "label": "Thing 1"
  },
  "count": 1234
}

5. When a filter and its paired aggregation are both applied, that aggregation's buckets are not filtered

Conversely, filters do apply to the buckets of all aggregations other than the paired aggregation. This initially confusing requirement is necessary because - for mutually exclusive values - application of the filter to the aggregation buckets will remove all but the selected bucket, thus removing the ability of the interface to show other options for the given filter. Non-mutually exclusive values are not affected by this.

For the example above, then filtering and aggregating on a.b like this:

http://host.name/path/docs?a.b=id1&aggregations=a.b

Would still return all of the buckets, even though the results only contain the id1 documents:

{
  "buckets": [
    {
      "data": {
        "id": "id1",
        "label": "Thing 1"
      },
      "count": 2345
    },
    {
      "data": {
        "id": "id2",
        "label": "Thing 2"
      },
      "count": 1234
    }
  ]
}

But if a separate (non-paired) filter was applied that happened to exclude the id2 buckets, then they would not be present.

6. When a filter and its paired aggregation are both applied, the bucket corresponding to the filtered value is always present

Explicitly: even if other filters or queries are present which cause a bucket which currently has an applied filter to be empty (ie, it has a count of 0), it still appears in the aggregation. This is necessary so that the interface for the filter can still be rendered.

7. Aggregations on fields contained in sum types return buckets of the type's components

In other words - there can be a discriminator present on objects meaning that aggregations on identical string properties of those objects return separate objects for each type. For example, given the following display documents:

{
  "a": {
    "b": [
      {
        "label": "A thing",
        "type": "TypeOne"
      },
      {
        "label": "A thing",
        "type": "TypeTwo"
      }
    ]
  }
}

Then an aggregation a.b.label would return separate buckets for each of the objects in b, even though their labels (the property being aggregated) are identical, because of the presence of the discriminator field type.

Open questions

  • Are these rules sufficient to tell us what we expect regarding empty buckets? Namely, that they should only be present when necessary to satisfy principle (6). Yes, they are.

  • Are there cases when aggregations/filters should be named differently to rule (1)? For example, if we want to aggregate on an id property is it sufficient to use the name of the identified entity (eg, a.b rather than a.b.id)? Yes, clarified above.

PreviousRFC 036: Modelling holdings recordsNextMatcher versioning

Last updated 10 months ago