Incident retro - wc.org intermittently available

Incident from: 2023-08-10

Incident until: 2023-08-10

Retro held: 2023-08-11

Timeline

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1691655733784049

7 August 2023

Prismic model changes

UI change that included removing query for imageList model - merged to main 16.49 Remove imageList from the model - merged to main 16.53 ? Model applied c16.55 Nothing published - needs something published for the model to apply. Web site broke after something was published on 8 August

8 August 2023

09.17 Report in #wc-platform-feedback “I'm getting a server error when I try to navigate to the homepage”

09.20 RC It seems to be back up now - is it for you? MD Hmm, no MD Ok, cleared my cache again. We're back RC we'll still investigate, thanks for flagging!

Then the web site intermittently showed server errors

09.22 RC The website was down for a bit, as we can see in the alerts channel, seems to be back up now? I'm going to say from 8:48. Still looks down on my 4G though, but up on my wifi, different servers issue? Edit: Looks fine on my 4G in incognito, so maybe just cache I couldn't see anything on Prismic status

09.25 AG There's still an error even though the page loads. Not sure if it's one of these that just exist Hydration failed because the initial UI does not match what was rendered on the server.

RC That's a React error, I'm thinking more of a warning?

AG Down again Same A client-side exception has occurred

09.33 AC Cannot find slice of type imageList RC I deleted that yesterday they were all gone Let me deploy to prod

09.34 AC fourth line of the application logs, have we changed something here RC Then we know the fix I'll deploy the latest changes to production We made a lot of changes to the prismic model yesterday, that one's on me though

09.36 RC https://buildkite.com/wellcomecollection/wc-dot-org-deployment/builds/2646 Should be < 10mins (edited)

09.40 RC End to ends are running but prod has been deployed so it should be fixed now

09.48 NP I've just got the error again. RC I can't see that error in our logs since 9:38

09.49 RC Maybe I'll try a lil cache clear

09:51 RC Right cache cleared, and still no logs since 9:38 that were related to that problem

09.54 RC We are still getting errors in the alerts channel though but I can't understand why, just looks like login logs.

09.55 RC It's the only thing I can see /account/api/auth/login?returnTo=[redacted] and it's not even an error, just a log

09.56 AC more likely the log link is funky

09.58 AC so the list of failing errors comes from the CloudFront logs and then it makes a best-guess attempt at application logs

Analysis of causes

  • Prismic model changed but not pushed to prod; was expecting a field that didn’t exist (slice of type imageList)

  • CloudFront errors confusing?

Actions

Paul/Alex

  • Widen time window of Kibana log link by adding an hour either side

Raphaëlle

  • Create Prismic model change log page in Prismic that gets published with a change log and is the publish that’s needed when you change the model

  • Modify tool to automatically update the Prismic model change log page

  • Add to script: if your change contains queries is it in production?

  • Investigate using fetch links only and removing graph queries

Last updated