Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - wc.org intermittently available

PreviousIncident retro - Images search downNextIncident retro - web site not available

Last updated 10 months ago

Incident from: 2023-08-10

Incident until: 2023-08-10

Retro held: 2023-08-11

Timeline

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1691655733784049

7 August 2023

Prismic model changes

UI change that included removing query for imageList model - merged to main 16.49 Remove imageList from the model - merged to main 16.53 ? Model applied c16.55 Nothing published - needs something published for the model to apply. Web site broke after something was published on 8 August

8 August 2023

09.17 Report in #wc-platform-feedback “I'm getting a server error when I try to navigate to the homepage”

09.20 RC It seems to be back up now - is it for you? MD Hmm, no MD Ok, cleared my cache again. We're back RC we'll still investigate, thanks for flagging!

Then the web site intermittently showed server errors

09.22 RC The website was down for a bit, as we can see in the alerts channel, seems to be back up now? I'm going to say from 8:48. Still looks down on my 4G though, but up on my wifi, different servers issue? Edit: Looks fine on my 4G in incognito, so maybe just cache I couldn't see anything on

09.25 AG There's still an error even though the page loads. Not sure if it's one of these that just exist Hydration failed because the initial UI does not match what was rendered on the server.

RC That's a React error, I'm thinking more of a warning?

AG Down again Same A client-side exception has occurred

09.33 AC Cannot find slice of type imageList RC I deleted that yesterday they were all gone Let me deploy to prod

09.34 AC fourth line of the application logs, have we changed something here RC Then we know the fix I'll deploy the latest changes to production We made a lot of changes to the prismic model yesterday, that one's on me though

09.36 RC https://buildkite.com/wellcomecollection/wc-dot-org-deployment/builds/2646 Should be < 10mins (edited)

09.40 RC End to ends are running but prod has been deployed so it should be fixed now

09.48 NP I've just got the error again. RC I can't see that error in our logs since 9:38

09.49 RC Maybe I'll try a lil cache clear

09:51 RC Right cache cleared, and still no logs since 9:38 that were related to that problem

09.54 RC We are still getting errors in the alerts channel though but I can't understand why, just looks like login logs.

09.55 RC It's the only thing I can see /account/api/auth/login?returnTo=[redacted] and it's not even an error, just a log

09.56 AC more likely the log link is funky

09.58 AC so the list of failing errors comes from the CloudFront logs and then it makes a best-guess attempt at application logs

Analysis of causes

  • Prismic model changed but not pushed to prod; was expecting a field that didn’t exist (slice of type imageList)

  • CloudFront errors confusing?

Actions

Paul/Alex

  • Widen time window of Kibana log link by adding an hour either side

Raphaëlle

  • Create Prismic model change log page in Prismic that gets published with a change log and is the publish that’s needed when you change the model

  • Modify tool to automatically update the Prismic model change log page

  • Add to script: if your change contains queries is it in production?

  • Investigate using fetch links only and removing graph queries

Prismic status
Timeline
Analysis of causes
Actions