Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - Elastic Cloud

PreviousIncident retro - January downtimeNextIncident retro - stories and home page down

Last updated 10 months ago

Incident from: 2021-02-24

Incident till: 2021-02-24

Retro held: 2021-02-25

Timeline

24 February January 2021

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1614165343000300

11.13 First alerts from updown in wc-platform / wc-experience Front End Single Work Page Images API Search Front End Works Search Images API Single Image Works API Search Works API Single Work Wellcome Images Redirect https://wellcome.slack.com/archives/C3TQSF63C/p1614165226063400

11.15 JP suggested moving to #wc-incident-response 11.15 AC: logging.wellcomecollection.org isn’t loading; AC: The fact that logging.wc.org is busted makes me think this might be an Elastic Cloud issue

11.15 JP: nothing here yet https://status.elastic.co/

11.16 Is something broken in the catalogue and hammering the logging cluster?

11.17 Doesn’t look like anything was deployed 11.17 JP restarting one of the catalogue API tasks to see if that helps 11.17 AC: all our clusters are unhealthy

11.18 JP confirmed the issue was search and work pages are down

11.19 RK suggests an elastic cloud issue

11.20 JP: nothing on their status page yet 11.20 AC manually turned staging API down to 0 tasks 11.20 Cloud console couldn’t give any useful metrics

11.22 JG emailed Laura, elastic account manager

11.25 Showing recovery; recovery messages from updown in wc-platform / wc-experience

11.29 From Laura (for future) Please send a support ticket is support@elastic.co as a severity-1 and they will look into it for you.

11.33 JG sent email and RK raised support ticket in the console

11.28 11.28 The other thing I had a concern about was a security issue (someone with access to the cloud console, with malicious intent) - but I think we can rule that out.cause I can still sign in!

11.59 RK expected to see emails to wellcomedigitalplatform@wellcome.ac.uk - which did eventually appear in RK’s spam

12.23 https://cloud-status.elastic.co/incidents/hnm8zyj41mmr - Elastic say they’ve recovered from whatever it was

Analysis of causes

Some problem at Elastic with a proxy layer in a single region

Actions

Robert Turn on 2FA for Elastic as root password is quite widely available? Robert to look at how to use SSO for logging into Elastic

Alex Investigate possibility of caching API responses for individual works

Have snapshots of ES clusters? Not needed as Elastic has been reliable enough so far.

Natalie to cc digital@wellcomecollection.org on emails letting other staff know what’s happening / tell wc-incident-response

Timeline
Analysis of causes
Actions