Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - January downtime

PreviousIncident retro - Miro imagesNextIncident retro - Elastic Cloud

Last updated 10 months ago

Incident from: 2021-01-28

Incident till: 2021-01-23

Retro held: 2021-01-29

Timeline

28 January 2021

See https://wellcome.slack.com/archives/C3TQSF63C/p1611911159323900

19.00 AC merged a change to add the license aggregation to the /images API

19.08 deployment of catalogue API

19.13 onwards works API search, works api single work, front end works search, Wellcome Images redirect down alerts

19.20 Alex rolls back to previous production release of catalogue API

19.20, 19.27 recovery notification

Analysis of causes

Expected change to not need a reindex, single PR didn’t, but deployment of master had more there than in the single PR.

The commit had a green tick so looked fine.

Fields added to internal model did need the reindex. Because that reindex hadn’t happened it caused the API to fall over.

Integration tests started failing Tue 26 January https://wellcome.slack.com/archives/C018ELHJVFE/p1611861071000400

This wasn’t seen so didn’t know there was an issue. Integration tests are in wc-platform-builds channel. Had been moved out of wc-platform because they were too noisy.

Actions

JP Have build badges in Build Kite for:

  • the main build

  • integration tests - https://github.com/wellcomecollection/catalogue/pull/1328

AC Reduce frequency of alerts so that it alerts:

  • At the start of every day

  • When state changes

  • https://github.com/wellcomecollection/platform/issues/5000

AC Move the alert into wc-platform-alerts

AFC Write RFC about internal model versioning and decoupling API and pipeline https://github.com/wellcomecollection/platform/issues/4998

Timeline
Analysis of causes
Actions