Incident retro - Elastic Cloud

Incident from: 2021-02-24

Incident till: 2021-02-24

Retro held: 2021-02-25

Timeline

24 February January 2021

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1614165343000300

11.13 First alerts from updown in wc-platform / wc-experience Front End Single Work Page Images API Search Front End Works Search Images API Single Image Works API Search Works API Single Work Wellcome Images Redirect https://wellcome.slack.com/archives/C3TQSF63C/p1614165226063400

11.15 JP suggested moving to #wc-incident-response 11.15 AC: logging.wellcomecollection.org isn’t loading; AC: The fact that logging.wc.org is busted makes me think this might be an Elastic Cloud issue

11.15 JP: nothing here yet https://status.elastic.co/

11.16 Is something broken in the catalogue and hammering the logging cluster?

11.17 Doesn’t look like anything was deployed 11.17 JP restarting one of the catalogue API tasks to see if that helps 11.17 AC: all our clusters are unhealthy

11.18 JP confirmed the issue was search and work pages are down

11.19 RK suggests an elastic cloud issue

11.20 JP: nothing on their status page yet 11.20 AC manually turned staging API down to 0 tasks 11.20 Cloud console couldn’t give any useful metrics

11.22 JG emailed Laura, elastic account manager

11.25 Showing recovery; recovery messages from updown in wc-platform / wc-experience

11.29 From Laura (for future) Please send a support ticket is [email protected] as a severity-1 and they will look into it for you.

11.33 JG sent email and RK raised support ticket in the console

11.28 11.28 The other thing I had a concern about was a security issue (someone with access to the cloud console, with malicious intent) - but I think we can rule that out.cause I can still sign in!

11.59 RK expected to see emails to [email protected] - which did eventually appear in RK’s spam

12.23 https://cloud-status.elastic.co/incidents/hnm8zyj41mmr - Elastic say they’ve recovered from whatever it was

Analysis of causes

Some problem at Elastic with a proxy layer in a single region

Actions

Robert Turn on 2FA for Elastic as root password is quite widely available? Robert to look at how to use SSO for logging into Elastic

Alex Investigate possibility of caching API responses for individual works

Have snapshots of ES clusters? Not needed as Elastic has been reliable enough so far.

Natalie to cc [email protected] on emails letting other staff know what’s happening / tell wc-incident-response

Last updated