Incident retro - January downtime

Incident from: 2021-01-28

Incident till: 2021-01-23

Retro held: 2021-01-29

Timeline

28 January 2021

See https://wellcome.slack.com/archives/C3TQSF63C/p1611911159323900

19.00 AC merged a change to add the license aggregation to the /images API

19.08 deployment of catalogue API

19.13 onwards works API search, works api single work, front end works search, Wellcome Images redirect down alerts

19.20 Alex rolls back to previous production release of catalogue API

19.20, 19.27 recovery notification

Analysis of causes

Expected change to not need a reindex, single PR didn’t, but deployment of master had more there than in the single PR.

The commit had a green tick so looked fine.

Fields added to internal model did need the reindex. Because that reindex hadn’t happened it caused the API to fall over.

Integration tests started failing Tue 26 January https://wellcome.slack.com/archives/C018ELHJVFE/p1611861071000400

This wasn’t seen so didn’t know there was an issue. Integration tests are in wc-platform-builds channel. Had been moved out of wc-platform because they were too noisy.

Actions

JP Have build badges in Build Kite for:

  • the main build

  • integration tests - https://github.com/wellcomecollection/catalogue/pull/1328

AC Reduce frequency of alerts so that it alerts:

  • At the start of every day

  • When state changes

  • https://github.com/wellcomecollection/platform/issues/5000

AC Move the alert into wc-platform-alerts

AFC Write RFC about internal model versioning and decoupling API and pipeline https://github.com/wellcomecollection/platform/issues/4998

Last updated