Incident retro - Elastic Cloud
Incident from: 2021-02-24
Incident till: 2021-02-24
Retro held: 2021-02-25
Timeline
24 February January 2021
See https://wellcome.slack.com/archives/C01FBFSDLUA/p1614165343000300
11.13 First alerts from updown in wc-platform / wc-experience Front End Single Work Page Images API Search Front End Works Search Images API Single Image Works API Search Works API Single Work Wellcome Images Redirect https://wellcome.slack.com/archives/C3TQSF63C/p1614165226063400
11.15 JP suggested moving to #wc-incident-response 11.15 AC: logging.wellcomecollection.org isn’t loading; AC: The fact that logging.wc.org is busted makes me think this might be an Elastic Cloud issue
11.15 JP: nothing here yet https://status.elastic.co/
11.16 Is something broken in the catalogue and hammering the logging cluster?
11.17 Doesn’t look like anything was deployed 11.17 JP restarting one of the catalogue API tasks to see if that helps 11.17 AC: all our clusters are unhealthy
11.18 JP confirmed the issue was search and work pages are down
11.19 RK suggests an elastic cloud issue
11.20 JP: nothing on their status page yet 11.20 AC manually turned staging API down to 0 tasks 11.20 Cloud console couldn’t give any useful metrics
11.22 JG emailed Laura, elastic account manager
11.25 Showing recovery; recovery messages from updown in wc-platform / wc-experience
11.29 From Laura (for future) Please send a support ticket is support@elastic.co as a severity-1 and they will look into it for you.
11.33 JG sent email and RK raised support ticket in the console
11.28 11.28 The other thing I had a concern about was a security issue (someone with access to the cloud console, with malicious intent) - but I think we can rule that out.cause I can still sign in!
11.59 RK expected to see emails to wellcomedigitalplatform@wellcome.ac.uk - which did eventually appear in RK’s spam
12.23 https://cloud-status.elastic.co/incidents/hnm8zyj41mmr - Elastic say they’ve recovered from whatever it was
Analysis of causes
Some problem at Elastic with a proxy layer in a single region
Actions
Robert Turn on 2FA for Elastic as root password is quite widely available? Robert to look at how to use SSO for logging into Elastic
Alex Investigate possibility of caching API responses for individual works
Have snapshots of ES clusters? Not needed as Elastic has been reliable enough so far.
Natalie to cc digital@wellcomecollection.org on emails letting other staff know what’s happening / tell wc-incident-response
Last updated