Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • The proposal
  • Benefits
  • Risks and mitigations
  • Possible future extensions

Removing deleted records from (re)indexes

PreviousHow do we tell users how to find stuff?NextRFC 044: Tracking Patron Deletions

Last updated 10 months ago

In the pipeline, when a record is deleted from a source system, we don't expunge it from the pipeline -- instead, we create a stub Work "this has been deleted". This returns a 410 Gone from the API and a special message on /works:

as opposed to a 404 Not Found for a work ID that doesn't exist:

At time of writing, we have 1.4M deleted Works (44% of the 3.2M works in the index). Every time we reindex, we have to reprocess all of these records – almost certainly a non-trivial cost of a reindex.

We should remove deleted source records from reindexes.

The proposal

Currently we keep track of source records in a VHS. Every source record that has ever existed has an object in S3 and a pointer to that object in DynamoDB:

When we reindex, we read every item from the DynamoDB table and send that to the pipeline.

The proposed approach: we have two DynamoDB tables with an identical structure:

  • A "live" table where we write new records in the adapter, and which is used by the catalogue pipeline and the reindexer. This is a continuation of the deleted table.

  • A "deleted" table. We move pointers to this table for source records that we don't want in the pipeline. This table is ignored by the catalogue pipeline and the reindexer.

We'd keep the same set of records overall, but now they're split over two tables. Over time, we move records between the two tables as we want to expose or hide them in the catalogue pipeline.

We'd have per-source scripts with two options:

  • Move all records deleted before a given date from "live" to "deleted"

  • Move specific records from "deleted" to "live"

This is intentionally a lossless, reversible process. We're not permanently deleting any data, and we could revert it later if there are unforeseen issues.

Benefits

  • We'd save money on reindexes. Deleted works account for 44% of the index, which means they're probably a substantial reindex cost.

    (I don't have exact numbers, but a reindex costs ~$150. Deleted Works are smaller than visible Works so they're a bit cheaper, but a lot of the costs are the same. I'd guess we're spending ~$50 per reindex to process deleted works.)

  • Reindexes would go faster. Having less records means less work to do in a reindex, so they should finish faster.

Risks and mitigations

  • 410 Gone would become 404 Not Found for any Works that we moved. If we don't put deleted Works in the final Elasticsearch index, the API will serve them as 404 Not Found instead of 410 Gone. This is potentially annoying for anybody who has a URL to a work that was previously available.

    As a first pass, we can avoid this by only removing Works that were deleted before the new catalogue API/website -- and so never had public URLs with the current URL scheme. If you filter to anything deleted before 1 Jan 2017, there are 876k Works (27% of the 3.2M total Works), and so still a significant saving.

    Although we could move more records to the "deleted" table if we wanted to, we don't need to do that for now. It remains as a future option if we decide we can live with the 404/410 distinction.

    We could also move individual records back to the "live" table as required.

  • We don't want to lose an audit trail around source data. Different source systems track deletions differently:

    • Sierra does "soft" deletes. When a record is deleted, Sierra still keeps a stub record to say "there used to be a record XYZ which was deleted". We can retrieve a deleted record from the Sierra API.

    • Calm does "hard" deletes. When a record is deleted, Calm removes it from the database and API calls (which is why we have the Calm deletion checker). The only way to know the record ever existed is to look at our source data tables.

    Completely removing these source records might cause us unforeseen issues later -- which is why the proposal is to move them into a separate table, rather than delete them.

  • Reindex stability might take a temporary hit. The pipeline services process Works in batches, e.g. write 50 Works to Elasticsearch, read 30 Works. If the size of all the Works in a batch is too big, you get an out-ofmemory error (either from our apps or Elasticsearch) -- so we've picked batch sizes that don't hit these limits.

    Deleted Works are pretty small -- when they're removed from the pipeline, the average size of a batch will go up, and we might need to tweak the batch sizes if we start hitting memory errors.

  • Some as-yet-unforeseen issue. If this doesn't work in practice, we reverse the process: merge the "deleted" table back into the "live" table, remove the "deleted" table, and we're back at our current setup.

Possible future extensions

We could move more records to the "deleted" table if we're happy to live without the 404/410 distinction.

  • We could set up a separate pipeline that indexes deleted Works directly from the "deleted" table, bypassing the more expensive parts of the catalogue pipeline.

Screenshot of a /works page saying "This work has been deleted"
Screenshot of a /works page saying "This isn’t the page you’re looking for, but how about these?"
Two blue DynamoDB tables (left) pointing to a red S3 bucket (right)