> For the complete documentation index, see [llms.txt](https://docs.wellcomecollection.org/incident-retros/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.wellcomecollection.org/incident-retros/2022-09-05_reporting-cluster-downtime-and-configuration-loss.md).

# Incident retro - reporting cluster downtime and configuration loss

**Incident from:** 2022-09-05

**Incident until:** 2022-09-08

**Retro held:** 2022-09-12

* [Timeline](#timeline)
* [Analysis of causes](#analysis-of-causes)
* [Actions](#actions)

## Timeline

See <https://wellcome.slack.com/archives/C3TQSF63C/p1611911159323900> and #wc-reporting-cluster-reconfiguration channel

25 August 2022

11.07: Started a rolling upgrade to 8.4.0 on the reporting cluster

13.04: Upgrade failed in unclear “stuck” state - everything still working fine

26th August

14.41: Opened case with Elastic to resolve stuck state

29th August

08.49: Automated maintenance (presumably to resolve state mismatch) failed, but with no adverse effects.

14.28: Advised by Elastic to delete some non-migrated indices

31st August

15.48: Index migration/removal completed, upgrade still failing and Elastic informed.

4th September

04.05: Another automated system maintenance event - moving nodes around. We think this is the event which started the downtime

“Move nodes off of allocator i-0900f7512ca10119c due to routine system maintenance”

5 September

08.20: Another failed automated maintenance event

09.12: High severity ticket raised with Elastic.

09.24: Elastic start manual recreation of cluster

10.27: Cluster recreation successful.

10.49: Kibana node upgrade started by JP (required manual changes in config)

11.08: Kibana upgrade complete

12.38: JP notifies Elastic that while data indices are present, Kibana saved objects have been lost

14.35: Elastic respond, noting short snapshot retention and unassigned indices being lost.

15:23: Elastic confirm configuration loss.

15.56 JP set up #wc-reporting-cluster-reconfiguration and begins reprovisioning application credentials.

17.13 JP I think all application credentials/roles are now reprovisioned

## Analysis of causes

Snapshot policy for the reporting cluster was set to one hour. This has since been changed to 30 days.

The reporting cluster started as a side project/prototype and should have been checked when we came to rely on it.

## Actions

**Jamie**

* Increase reporting cluster snapshot policy to 30 days - done

**For planning**

* Move reporting cluster config into terraform


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.wellcomecollection.org/incident-retros/2022-09-05_reporting-cluster-downtime-and-configuration-loss.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
