Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
  • RFC 075: Using Apache Iceberg tables in Catalogue Pipeline adapters
Powered by GitBook
On this page
  • Background
  • Proposal
  • Minimising Calm API queries
  • Questions & potential issues

RFC 032: Calm deletion watcher

PreviousRFC 031: Relation BatcherNextRFC 033: Api internal model versioning

Last updated 10 days ago

This RFC describes a proposal for a Calm deletion watcher, which will allow us to detect deleted Calm records and update the VHS accordingly.

Last modified: 2021-02-09T17:03:48+00:00

Background

We've realised that the pipeline does not pick up deleted Calm records - once they're deleted there is no trace of them in the Calm API, but they remain in the source VHS tables. Especially because a cleanup project at the end of last year was responsible for a very large number of deletions, this needs to be resolved.

So far as we can tell, there is no way to find deleted records other than checking for their absence. This means that our solution needs to take some sort of polling-based approach: going through the source table and checking for the continued presence of each record in the Calm API.

Proposal

The deletion watcher will consist of:

  • A worker task in the calm adapter cluster

  • A lambda

  • The existing reindexer

The lambda will place a message on the reindexer queue requesting a "reindex" of Calm records to be sent to the worker. It will be triggered by a scheduled Cloudwatch event or manually. By default it will request the full source scan but in the case of a manual trigger it will be possible to request specific IDs, which will be useful in the case of expediting known deletions.

The worker will consume records, filter out those that are already flagged as deleted, and check for the existence of the rest in the Calm API. If the record has been deleted (is no longer present) then it will update the VHS entry to flag the deletion. The deletion flag will live in the DynamoDB record so that the flag can be checked without fetching the object from S3. After updating a record, the worker will notify the transformer topic.

The calm transformer will check for the presence of the deletion flag in the source data and create Deleted works as appropriate.

This architecture is illustrated below (arrows indicate the direction of data flow):

image

Minimising Calm API queries

Checking for the existence of the records can be done naïvely by performing a search (the only relevant action available to us in the current API version) for a given record ID and confirming that the returned SearchResult number is equal to 1. Obviously, this requires 1 request per record.

We can, however, do better than this. Consider that we have a large population of records with a fairly low prevalence of deletions. In one query, we can search for a set of N records, and know that the difference between N and the number of results is the number of deleted records d in that set. If there is no difference, we can move on immediately.

At this point we could iterate through either the results or the record IDs to find which are missing, or we could find the missing records via binary search - but the former means that we only reduce the number of queries if there are no records missing among the set, and the latter is both poorly suited for finding multiple occurrences, and does not take advantage of our knowledge of d.

Note: I have verified that the Calm API is happy to take requests for batches of IDs up to (and probably beyond) a size of 1000.

Questions & potential issues

  • Are we really 100% sure that polling for deletions is the only way to detect them?

  • Might we want to be continually checking (more like crawling) for deletions rather than doing it all at once?

Perhaps unsurprisingly, this is a problem that has been considered in the literature and indeed is particularly relevant at the moment - for example, in October described a strategy for finding positive SARS-CoV-2 tests in individuals by pooled testing across a population. Our testing tool can tell us how many deleted items there are in a given set, so is an instance of the specific problem of quantitative group testing.

Although optimal or near-optimal strategies for solving this problem do , their implementations are quite involved compared to others, so as a compromise we can use the result from which has a very elegant implementation and reasonably competitive performance.

at length
Mutesa et al (2020)
exist
Wang et al (2017)