Incident retro - reporting cluster downtime and configuration loss
Last updated
Last updated
Incident from: 2022-09-05
Incident until: 2022-09-08
Retro held: 2022-09-12
See https://wellcome.slack.com/archives/C3TQSF63C/p1611911159323900 and #wc-reporting-cluster-reconfiguration channel
25 August 2022
11.07: Started a rolling upgrade to 8.4.0 on the reporting cluster
13.04: Upgrade failed in unclear “stuck” state - everything still working fine
26th August
14.41: Opened case with Elastic to resolve stuck state
29th August
08.49: Automated maintenance (presumably to resolve state mismatch) failed, but with no adverse effects.
14.28: Advised by Elastic to delete some non-migrated indices
31st August
15.48: Index migration/removal completed, upgrade still failing and Elastic informed.
4th September
04.05: Another automated system maintenance event - moving nodes around. We think this is the event which started the downtime
“Move nodes off of allocator i-0900f7512ca10119c due to routine system maintenance”
5 September
08.20: Another failed automated maintenance event
09.12: High severity ticket raised with Elastic.
09.24: Elastic start manual recreation of cluster
10.27: Cluster recreation successful.
10.49: Kibana node upgrade started by JP (required manual changes in config)
11.08: Kibana upgrade complete
12.38: JP notifies Elastic that while data indices are present, Kibana saved objects have been lost
14.35: Elastic respond, noting short snapshot retention and unassigned indices being lost.
15:23: Elastic confirm configuration loss.
15.56 JP set up #wc-reporting-cluster-reconfiguration and begins reprovisioning application credentials.
17.13 JP I think all application credentials/roles are now reprovisioned
Snapshot policy for the reporting cluster was set to one hour. This has since been changed to 30 days.
The reporting cluster started as a side project/prototype and should have been checked when we came to rely on it.
Jamie
Increase reporting cluster snapshot policy to 30 days - done
For planning
Move reporting cluster config into terraform