Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - reporting cluster downtime and configuration loss

PreviousIncident retro - requests not showing in accountNextIncident retro - story page appearing then replaced by a 404

Last updated 10 months ago

Incident from: 2022-09-05

Incident until: 2022-09-08

Retro held: 2022-09-12

Timeline

See https://wellcome.slack.com/archives/C3TQSF63C/p1611911159323900 and #wc-reporting-cluster-reconfiguration channel

25 August 2022

11.07: Started a rolling upgrade to 8.4.0 on the reporting cluster

13.04: Upgrade failed in unclear “stuck” state - everything still working fine

26th August

14.41: Opened case with Elastic to resolve stuck state

29th August

08.49: Automated maintenance (presumably to resolve state mismatch) failed, but with no adverse effects.

14.28: Advised by Elastic to delete some non-migrated indices

31st August

15.48: Index migration/removal completed, upgrade still failing and Elastic informed.

4th September

04.05: Another automated system maintenance event - moving nodes around. We think this is the event which started the downtime

“Move nodes off of allocator i-0900f7512ca10119c due to routine system maintenance”

5 September

08.20: Another failed automated maintenance event

09.12: High severity ticket raised with Elastic.

09.24: Elastic start manual recreation of cluster

10.27: Cluster recreation successful.

10.49: Kibana node upgrade started by JP (required manual changes in config)

11.08: Kibana upgrade complete

12.38: JP notifies Elastic that while data indices are present, Kibana saved objects have been lost

14.35: Elastic respond, noting short snapshot retention and unassigned indices being lost.

15:23: Elastic confirm configuration loss.

15.56 JP set up #wc-reporting-cluster-reconfiguration and begins reprovisioning application credentials.

17.13 JP I think all application credentials/roles are now reprovisioned

Analysis of causes

Snapshot policy for the reporting cluster was set to one hour. This has since been changed to 30 days.

The reporting cluster started as a side project/prototype and should have been checked when we came to rely on it.

Actions

Jamie

  • Increase reporting cluster snapshot policy to 30 days - done

For planning

  • Move reporting cluster config into terraform

Timeline
Analysis of causes
Actions