Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - slow search due to 900k messages on the ingestor queue

PreviousIncident retro - increased rate of errors in searches on wellcomecollection.orgNextIncident retro - concept pages not available

Last updated 10 months ago

Incident from: 2022-10-04

Incident until: 2022-10-04

Retro held: 2022-10-04

Timeline

4 October 2022

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1664871483404409

~02.00 900k messages on the ingestor queue

09.00 User reported issue with search and posted an internal server error image. Very slow search confirmed by NP.

09.25 AC cluster claims healthy, but I’d guess it’s under sustained load somehow

09.36 AC issue found the issue is that the works ingestor is hammering the cluster

009:38 AC okay, think I've applied a fix the basic issue is that there are 900k messages on the ingestor queue the ingestor is the app that populates the API index if we send too many writes to Elasticsearch, the cluster will struggle to respond to incoming requests (aka users) and it's been saturated since ~2am this morning, which is when a Calm ~> Sierra record harvest occurs (this is to allow items catalogued in Calm to be ordered through Sierra) most of those are a no-up change for us, the only update is the "last synced from Calm" field which we don't expose on the front-end, so we filter out the errors but this tiny change will have caused everything to get re-sent: https://github.com/wellcomecollection/catalogue-pipeline/pull/2212 #2212 MeSH, not MESH If you look at the NLM website, it's a lowercase 'e':

09:42 AC API seems to be back for me

09.44 NP confirmed that search is running fast again

Analysis of causes

Unexpected load from the overnight Sierra harvest, which in turn caused: the basic issue is that there are 900k messages on the ingestor queue the ingestor is the app that populates the API index if we send too many writes to Elasticsearch, the cluster will struggle to respond to incoming requests (aka users)

Actions

  • If queue crosses e.g 1.5 mill messages, stop the ingestor and send a message to Slack (check SQS metrics to determine the threshold). Discussed but decided not to do.

Alex

  • Investigate why a label change caused this problem

Paul

  • Investigate the retention time on the queues when not reindexing

Timeline
Analysis of causes
Actions