Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
  • RFC 075: Using Apache Iceberg tables in Catalogue Pipeline adapters
Powered by GitBook
On this page
  • Context
  • Components of the rank ecosystem
  • Environments (CI and local)
  • Elasticsearch cluster
  • CLI
  • Test configuration
  • Rank eval API requests
  • Test indices
  • Test outputs
  • Features
  • CLI command tree

RFC 058: Relevance testing

PreviousRFC 056: Prismic to Elasticsearch ETL pipelineNextExamples of rank CLI usage

Last updated 10 days ago

This RFC describes how and why we might write a new version of rank, our relevance testing tool.

Last modified: 2023-06-20T14:04:56+01:00

Context

We develop and test the relevance of our search results using a tool called rank. By making sure that our queries return the expected results for a set of known, indicative search terms, we can be confident that search is performing as intended.

Rank began as a browser-based UI for displaying a few simple tests which were run against the elasticsearch . It was written in next.js and typescript, deployed on vercel, and stored a lot of its tests and configuration as json.

These days, rank no longer has any kind of browser interface. Instead, rank is a CLI tool which allow developers to run a much broader range of search relevance tests, with utilities for managing the testing data and infrastructure. Ad-hoc experiments can be run locally by developers, while the tests are also run automatically in CI on a regular basis, with alerts set up for any regressions in search quality.

Rank is still written in typescript, and uses a lot of the same code as the original browser-based version.

It's taken us a while to figure out what rank is and how it should be used, and in that time, we've built up a lot of technical debt in the tooling.

Now that rank's purpose is more stable and its direction of travel is clearer, we should take the opportunity to rewrite some of the more problematic parts of its codebase.

Components of the rank ecosystem

Environments (CI and local)

Rank can be run locally to measure the effect that experimental mappings or queries will have on search quality.

We also run rank tests in CI to make sure that drift in the underlying dataset won't cause a regression in search quality or invalidate our tests.

This pattern works well. We should keep it!

Elasticsearch cluster

The rank cluster is kept separate so that tests can be run without worrying about expensive or long-running queries affecting production services.

Like the rest of our clusters, the rank cluster is defined and managed in terraform.

Again, this structure is good and we should keep it!

CLI

The CLI is where the majority of rank's technical debt has built up, and where there is the most potential for improvement.

While javascript/typescript is an appropriate language for making straightforward rank eval API requests from a browser, it's not a great choice for writing CLIs or performing any complicated manipulation of the responses.

Users should be able to install the rank CLI with pip from a local pyproject.toml (ie it shouldn't need to be published to registries like pypi). The CLI should run with a top-level rank command.

Test configuration

Tests are currently formatted as json, and live in a data directory alongside mappings, queries, and terms (common search terms from real users, collected from the reporting cluster). At test time, we read these json documents, map them into some rigidly defined test structures, and then run them against the target index. The test results are then written to stdout.

These tests aren't data, and shouldn't exist as static files which are read by some smart test-constructing code. Test logic is over-abstracted, making it difficult to write new tests which are expressive of a test's intention, or how they're being scored.

We should instead be writing tests as code, more tightly coupled with the test-runner. Each test should be expressive of the intent of the test, and of how its pass/failure is being calculated.

Rank eval API requests

Our testing needs have developed over time, and we rarely use elasticsearch's rank eval API in the way that it's supposed to be used. In many cases, the requests we use to analyse quality are straightforward search requests.

The outputs of our tests are often more binary than the rank eval API's responses, and we're discarding information about scoring which might be useful. In other cases, we've extended our code to test things which the API doesn't support, eg relative positions of expected results.

These differences are hard to understand from the code, and are not well documented. A new implementation should be clearer about where those divergences are.

Test indices

At the moment, we're able to test against indices copied from the catalogue API, ie works and images. We'd like to be able to test against other types of content like articles, exhibitions, events, functional content, and concepts.

These source indices live in different clusters, so supporting cross cluster replication into rank from multiple clusters in future would be necessary.

Test outputs

We know that optimising search relevance is a game of compromises, and that we're unlikely to be able to satisfy every search intention perfectly. In other words, we expect some of our tests to fail every time, even when search quality is good.

For example, we might run a test for a new candidate mapping/query where document abc is expected to appear as the first result. A new version of the index might cause it to appear as the second result instead. The current version of rank would consider this a failure, and would alert us to the regression.

To keep things moving in cases where we're satisfied with the overall search quality, users can currently set a knownFailure flag on individual tests, allowing the full suite to pass as a whole even when individual tests fail.

knownFailures are a useful but ugly bandage over an interesting problem. They obscure the severity of each failure during experiments, and make it harder to evaluate the development of search relevance over time.

While the example above might represent a minor degradation in search quality for one intention, it's not something we would normally consider a catastrophic failure. If the same change led to an improvement in search quality for a different set of intentions, we might still want to deploy it.

The scoring and passing of tests should be more nuanced, and should be able to account for the ideal and worst-case scenarios for each test. Ideally, we should show show a summary of the scores at the end of each test run, even in cases where the tests pass with ideal scores.

We should still be alerted to any degradations in individual or overall scores in CI, and we should be able to set extreme thresholds for each test which would cause rank to fail.

Features

Having established the problems with the current rank CLI, we can start to think about what we'd like to be able to do with a new version.

  • Setup and index management

    • Copy a production index to the rank cluster (without affecting production search)

    • Create a new index in the rank cluster with a given mapping, using data from a copied production index

    • Update candidate index config in the rank cluster

    • Check the progress of a reindex

    • Delete an index in the rank cluster

    • Fetch copies of the index config for a production index

    • Fetch copies of the queries which run in production search

  • Local testing and experimentation

    • Run rank tests, outputting an overall pass/fail result along with a summary of the individual tests

    • Run an individual test, or a subset of tests

    • Run a search with a candidate query against a candidate index, outputting formatted results on the command line

    • Compare the speed of candidate queries against production queries

  • CI testing

    • Run all tests in CI, outputting a pass/fail status with a summary of the individual tests

CLI command tree

The following is a rough tree structure of the CLI commands which we'd like to support in v2.

rank
├── index
│   ├── list
│   ├── create
│   ├── update
│   ├── delete
│   ├── get
│   └── replicate
├── task
│   ├── check
│   └── delete
├── search
│   ├── get-terms
│   └── compare
└── test
    ├── run
    └── list

Testing indices are held in a dedicated rank cluster, away from our production cluster. The test indices are snapshots of the production data, intermittently copied over from production clusters with .

Data manipulation is cumbersome in typescript, and we've inherited a lot of the original code from the browser-based version of rank (for example, the way in which requests are bundled using currently makes it harder to run batches of requests, instead of easier!).

We should rewrite rank's CLI and testing backbone in python, using the , , and .

Typer includes a lot of the functionality that we've had to build ourselves in typescript (argument/option parsing and prompting, etc), and goes further in many cases (for example, it can and full for the CLI).

Moving to python will also give us access to data science libraries like and , making it easier to develop more complex analyses of the test results.

Pytest's and might help us achieve this plain-language test-writing style.

It's possible to augment pytest's outputs with extra context by writing extra . By defining a new pytest_report_header() or pytest_terminal_summary() etc we can be much more explicit about what the output of each test (or group of tests) means.

In those cases where scores are below an ideal threshold but aren't so bad that we want the whole suite to fail, we can use pytest to raise , which fall into a separate section of the output.

NB These changes to scoring would represent a meaningful (but incomplete) step towards a proper testing implementation, for which we currently don't have the data.

See for some examples of how these commands might be used.

ranking evaluation API
CCR
search templates
elasticsearch python client
pytest
typer
automatically generate --help
static docs
pandas
numpy
parametrised tests
fixtures
plugins
warnings
NDCG
examples