Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Motivation
  • Proposal
  • Questions & drawbacks

RFC 021: Data science in the pipeline

Status: 🏗 Draft 🚧

Last updated: 2020/07/29

Motivation

We want to augment works and images with data inferred from them using data science techniques: for example, feature vectors and colour palettes for images.

Currently, we do this by holding both some form of index (usually a set of points in a vector space) as well as a model in a separate service - for example, https://labs.wellcomecollection.org/feature-similarity. This has significant drawbacks: data is duplicated, outdated, patchy, and perhaps most importantly, must be searched as a wholly separate all-or-nothing operation outside of our ES indices.

To resolve this, we want to bring data science services into the pipeline, and use them to augment works/images with data that can be indexed and searched in ES. This RFC details how that might look.

Proposal

We can write down some desiderata for any proposed solution:

  • Data science (DS) logic can be written in Python

  • DS services do not know about our data model(s)

  • Pipeline services do not know about data science

  • DS services do not know about SQS, message passing, the Actor model, etc

  • DS models are persistent and separate to DS inferrers

  • DS models can be retrained on demand

These suggest 3 types of distinct, but loosely coupled, services:

  • Inferrer: A Python service that provides a synchronous API (most likely RESTful HTTP) that consumes whatever is needed to infer the new data, which it outputs. There can be multiple different inferrers.

  • Inference Manager: A Scala service that lives in the existing pipeline and contains "the usual" Wellcome message-passing, Akka, etc logic & libraries. It performs any work that is required by all of the inferrers, synchronously calls all of them and attaches the new data from them to the work/image (by populating a field) before passing it along. There is one inference manager for all the inferrers.

  • Model Trainer: A Python service that can consume records from the catalogue index in bulk in order to train a model, and outputs/stores a persistent representation of this model for an inferrer to use. There is optionally one model trainer for each inferrer.

The usage of these services would look like this:

Implementation details

  • The inferrers and inference manager exist in one task definition (and therefore on one host).

  • Non-trivial results of the shared work that the manager might perform (eg, images that it downloads and that are required by all of the inferrers) are stored in EBS volumes attached to each host and mounted in both the manager and inferrer tasks. In this case, the request to the inferrers from the manager would include a local filesystem path to the files. As the EBS volumes just act as a shared cache they can delete on termination of the host instance.

  • The model trainer is run as a standalone one-off ECS task from a local script.

  • The inferrer loads the model from S3 when it starts. Inferrer instances will be short-lived (as they'll scale to zero when not in use) so triggering a restart or loading a new model by other means is not necessary.

Questions & drawbacks

How do we deal with the fact that different inferrers have to live on one host but may have differing compute requirements? Potential answer: if that requirement is GPU, use elastic inference rather than GPU instance classes.

Should DS APIs be open to the public?

Should model artefacts be open to the public?

Should DS inferrers work only for things we know (ie feature vectors can be inferred only for images within the WC ecosystem, by passing an image ID) or with any content (ie feature vectors can be inferred for any image, by passing a publicly accessible URI)

It seems like we'll have to reinfer on every existing work/image when a model is retrained, which will significantly add to the expense of retraining based on the above considerations about network requests etc. Is this necessarily the case?

PreviousRFC 020: Locations and requestingNextRFC 022: Logging

Last updated 10 months ago

An architecture diagram for data science services