Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - works search errors

PreviousIncident retro - requests not showing in accountNextIncident retro - date picker

Last updated 10 months ago

Incident from: 2022-05-24

Incident until: 2022-05-24

Retro held: 2022-05-26

Timeline

24 May 2022

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1653408854658319

15.03 alerts for errors in the catalogue API, followed by a continuous stream of alerts for cloudfront

15.15 (approx) CCR of works index started from pipeline cluster into rank cluster

16.00 (approx) HP noticed 500 and asked if search was broken in #wc-incident-response, later deleted that message as reran search and it worked as expected

17.14 AC https://wellcomecollection.org/works?page=1&query=conosceva looks like the API is unhappy

Saw api was flaky

Cleared index caches which didn’t help

Suggested check refresh interval seen in previous issues - wasn’t that

A rolling restart did fix the issue

PB That link works for me in production but not if I switch my toggle to staging api

17.43 AC We seem to have fixed the issue

Analysis of causes

Combination of different factors:

  • CCR was the trigger which depleted some resource

  • Increased document size was a contributing factor

  • Not being able to clear out the circuit breakers

Actions

Harrison

  • Add notes to rank documentation as reminders for:

    • Increase cluster size when doing a CCR as that is memory intensive and remember to turn down the size afterwards

    • More communication with colleagues when you’re about to use CCR

    • CCR works and images separately

Alex

  • Improve alerts in Slack so it’s easier to triage and spot new errors - DONE

Team

  • Fix known errors in Slack as they arise

Timeline
Analysis of causes
Actions