Incident retro - slow search due to 900k messages on the ingestor queue

Incident from: 2022-10-04

Incident until: 2022-10-04

Retro held: 2022-10-04

Timeline

4 October 2022

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1664871483404409

~02.00 900k messages on the ingestor queue

09.00 User reported issue with search and posted an internal server error image. Very slow search confirmed by NP.

09.25 AC cluster claims healthy, but I’d guess it’s under sustained load somehow

09.36 AC issue found the issue is that the works ingestor is hammering the cluster

009:38 AC okay, think I've applied a fix the basic issue is that there are 900k messages on the ingestor queue the ingestor is the app that populates the API index if we send too many writes to Elasticsearch, the cluster will struggle to respond to incoming requests (aka users) and it's been saturated since ~2am this morning, which is when a Calm ~> Sierra record harvest occurs (this is to allow items catalogued in Calm to be ordered through Sierra) most of those are a no-up change for us, the only update is the "last synced from Calm" field which we don't expose on the front-end, so we filter out the errors but this tiny change will have caused everything to get re-sent: https://github.com/wellcomecollection/catalogue-pipeline/pull/2212 #2212 MeSH, not MESH If you look at the NLM website, it's a lowercase 'e':

09:42 AC API seems to be back for me

09.44 NP confirmed that search is running fast again

Analysis of causes

Unexpected load from the overnight Sierra harvest, which in turn caused: the basic issue is that there are 900k messages on the ingestor queue the ingestor is the app that populates the API index if we send too many writes to Elasticsearch, the cluster will struggle to respond to incoming requests (aka users)

Actions

If queue crosses e.g 1.5 mill messages, stop the ingestor and send a message to Slack (check SQS metrics to determine the threshold). Discussed but decided not to do.

Alex

Investigate why a label change caused this problem

Paul

Investigate the retention time on the queues when not reindexing

PreviousIncident retro - increased rate of errors in searches on wellcomecollection.org NextIncident retro - concept pages not available

Last updated 1 year ago