Incident retro - January downtime
Incident from: 2021-01-28
Incident till: 2021-01-23
Retro held: 2021-01-29
Timeline
28 January 2021
See https://wellcome.slack.com/archives/C3TQSF63C/p1611911159323900
19.00 AC merged a change to add the license aggregation to the /images API
19.08 deployment of catalogue API
19.13 onwards works API search, works api single work, front end works search, Wellcome Images redirect down alerts
19.20 Alex rolls back to previous production release of catalogue API
19.20, 19.27 recovery notification
Analysis of causes
Expected change to not need a reindex, single PR didn’t, but deployment of master had more there than in the single PR.
The commit had a green tick so looked fine.
Fields added to internal model did need the reindex. Because that reindex hadn’t happened it caused the API to fall over.
Integration tests started failing Tue 26 January https://wellcome.slack.com/archives/C018ELHJVFE/p1611861071000400
This wasn’t seen so didn’t know there was an issue. Integration tests are in wc-platform-builds channel. Had been moved out of wc-platform because they were too noisy.
Actions
JP Have build badges in Build Kite for:
the main build
integration tests - https://github.com/wellcomecollection/catalogue/pull/1328
AC Reduce frequency of alerts so that it alerts:
At the start of every day
When state changes
https://github.com/wellcomecollection/platform/issues/5000
AC Move the alert into wc-platform-alerts
AFC Write RFC about internal model versioning and decoupling API and pipeline https://github.com/wellcomecollection/platform/issues/4998
Last updated