Incident retro - reporting cluster downtime and configuration loss
Incident from: 2022-09-05
Incident until: 2022-09-08
Retro held: 2022-09-12
Timeline
See https://wellcome.slack.com/archives/C3TQSF63C/p1611911159323900 and #wc-reporting-cluster-reconfiguration channel
25 August 2022
11.07: Started a rolling upgrade to 8.4.0 on the reporting cluster
13.04: Upgrade failed in unclear “stuck” state - everything still working fine
26th August
14.41: Opened case with Elastic to resolve stuck state
29th August
08.49: Automated maintenance (presumably to resolve state mismatch) failed, but with no adverse effects.
14.28: Advised by Elastic to delete some non-migrated indices
31st August
15.48: Index migration/removal completed, upgrade still failing and Elastic informed.
4th September
04.05: Another automated system maintenance event - moving nodes around. We think this is the event which started the downtime
“Move nodes off of allocator i-0900f7512ca10119c due to routine system maintenance”
5 September
08.20: Another failed automated maintenance event
09.12: High severity ticket raised with Elastic.
09.24: Elastic start manual recreation of cluster
10.27: Cluster recreation successful.
10.49: Kibana node upgrade started by JP (required manual changes in config)
11.08: Kibana upgrade complete
12.38: JP notifies Elastic that while data indices are present, Kibana saved objects have been lost
14.35: Elastic respond, noting short snapshot retention and unassigned indices being lost.
15:23: Elastic confirm configuration loss.
15.56 JP set up #wc-reporting-cluster-reconfiguration and begins reprovisioning application credentials.
17.13 JP I think all application credentials/roles are now reprovisioned
Analysis of causes
Snapshot policy for the reporting cluster was set to one hour. This has since been changed to 30 days.
The reporting cluster started as a side project/prototype and should have been checked when we came to rely on it.
Actions
Jamie
Increase reporting cluster snapshot policy to 30 days - done
For planning
Move reporting cluster config into terraform
Last updated