Incident retro - cross cluster replication

Incident from: 2021-09-28

Incident until: 2021-09-28

Retro held: 2021-09-28

Timeline

28 September 2021

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1632815554058100

08.33 AC (in #wc-platform): I’ve set up CCR between the clusters, waiting for that to happen [Started via Kibana, not a script.]

08.52 Updown alerts: Works API Search Works API Single Work

AC: Okay, so it looks like the catalogue-api cluster is unhappy Presumably because we’ve just tried to set up CCR for four new indices Works search will be broken [Broken search not seen during this by NP] Only one of the three nodes is borked, so try again and you’ll likely hit one of the other nodes

08.53 Recovery: Works API Search Works API Single Work

AC: I think Updown hit a node that’s up

08.56 AC: I’ve just deleted six indexes from July which we’re no longer using, to alleviate pressure on the cluster

08.59 AC: I think we have to wait for the cross-cluster replication of the new images index to complete

09.04 AC: My best guess is that trying to configure CCR for the new indexes used too much memory and starting knocking nodes over Last I checked, three of the four indexes had completed their initial replication

09.07 JG: We’ve lost our master node, meaning we might have lost data, do we have a plan to get it back up? Have we paused CCR? AC couldn’t get into Kibana to check fourth index JG: Elasticsearch is down

09.08 JG: Biggy is getting elasticsearch up and running again. AC: I’ve already deleted a handful of old indexes we aren’t using

09.09 AC: To my eye it looks like we’re bouncing I see an instance with high JVM memory pressure, then I reload and it’s gone, then I reload and it’s back to high pressure, [followed by normal pressure]

09.10 JG suggest to kill a CCR AC: possible but can’t get into that part of Kibana. Can get into dev tools but CCR UI is in Stack management and don’t know how to use CCR API

09.13 AC deleting a few more unwanted indices Unable to issue index management commands through dev tools presumably because there’s no master

09.14 Updown alert Front End Works Search (Origin)

09.15 Recovery Front End Works Search (Origin)

09.16 Got master back [intermittently]

09.17 JG couldn’t disable [Elasticsearch] snapshots as all APIs are down AC lost master node again

09.18 Updown alert Images API Search

AC/JG discussing:

  • Restart ES and ElasticCloud - might be risky

  • Spin up another catalogue-api cluster to start serving requests

09.20 Agreed to temporarily increase memory on the cluster -> would restart it and trigger a rolling restart which gives more headroom

09.21 Couldn’t restart by applying Elasticsearch change; chose to not report to Elastic support

09.25 AC: so I have another idea for a short-term fix and possibly a long-term fix the snapshot generator already bypasses the API cluster and reads from the relevant pipeline cluster What if we configure the API to do the same, and bypass the unhealthy cluster? (I am cheating here by suggesting the code solution I think we want long-term) But we can deploy a new API as fast as Buildkite will let us

09.25 AC I’ve managed to get into the CCR console and unfollow the images-indexed-2021-09-27 index Which is the one index that hadn’t completed initial replication

09.28 Looked like it was working again but AC reported master unhealthy

09.29 Recovery Images API Search

AC: I’m unfollowing all the 2021-09-27 indexes JG: I’d like to talk about it being the long term solution, but for now it solves the short term one. Assuming what we need to do is update the secrets and flip the services? JG tried rolling restart

09.30 AC: we are only doing CCR for the indexes we’re currently serving as prod that blocks us from deploying anything new

09.34 JG: I have set the [Elasticsearch] snapshots to not run till saturday while we debug this (note to set it back)

I think the restart might have done it [recovery] but we are now in hypothetical land.

09.38 AC we’re receiving updates to prod indices And I don’t think we can roll forward to newer indices [because it might replicate the problem]

Analysis of causes

Set up CCR for four new indices. These are larger as they now include TEI.

Poor understanding of ES search and characteristics, including CCR specifically

Actions

AC/AFC

  • Change how the API behaves, spin up a new cluster, and do the CCR into the new cluster

  • Make a small change to the API to tell it to read from the new cluster dynamically

JG

  • Plan for getting better profiling for ES to find where the bottlenecks are to take to product

Last updated