Incident retro - wc.org intermittently available
Incident from: 2023-08-10
Incident until: 2023-08-10
Retro held: 2023-08-11
Timeline
See https://wellcome.slack.com/archives/C01FBFSDLUA/p1691655733784049
7 August 2023
Prismic model changes
UI change that included removing query for imageList model - merged to main 16.49 Remove imageList from the model - merged to main 16.53 ? Model applied c16.55 Nothing published - needs something published for the model to apply. Web site broke after something was published on 8 August
8 August 2023
09.17 Report in #wc-platform-feedback “I'm getting a server error when I try to navigate to the homepage”
09.20 RC It seems to be back up now - is it for you? MD Hmm, no MD Ok, cleared my cache again. We're back RC we'll still investigate, thanks for flagging!
Then the web site intermittently showed server errors
09.22 RC The website was down for a bit, as we can see in the alerts channel, seems to be back up now? I'm going to say from 8:48. Still looks down on my 4G though, but up on my wifi, different servers issue? Edit: Looks fine on my 4G in incognito, so maybe just cache I couldn't see anything on Prismic status
09.25 AG There's still an error even though the page loads. Not sure if it's one of these that just exist Hydration failed because the initial UI does not match what was rendered on the server.
RC That's a React error, I'm thinking more of a warning?
AG Down again Same A client-side exception has occurred
09.33 AC Cannot find slice of type imageList RC I deleted that yesterday they were all gone Let me deploy to prod
09.34 AC fourth line of the application logs, have we changed something here RC Then we know the fix I'll deploy the latest changes to production We made a lot of changes to the prismic model yesterday, that one's on me though
09.36 RC https://buildkite.com/wellcomecollection/wc-dot-org-deployment/builds/2646 Should be < 10mins (edited)
09.40 RC End to ends are running but prod has been deployed so it should be fixed now
09.48 NP I've just got the error again. RC I can't see that error in our logs since 9:38
09.49 RC Maybe I'll try a lil cache clear
09:51 RC Right cache cleared, and still no logs since 9:38 that were related to that problem
09.54 RC We are still getting errors in the alerts channel though but I can't understand why, just looks like login logs.
09.55 RC It's the only thing I can see /account/api/auth/login?returnTo=[redacted] and it's not even an error, just a log
09.56 AC more likely the log link is funky
09.58 AC so the list of failing errors comes from the CloudFront logs and then it makes a best-guess attempt at application logs
Analysis of causes
Prismic model changed but not pushed to prod; was expecting a field that didn’t exist (slice of type imageList)
CloudFront errors confusing?
Actions
Paul/Alex
Widen time window of Kibana log link by adding an hour either side
Raphaëlle
Create Prismic model change log page in Prismic that gets published with a change log and is the publish that’s needed when you change the model
Modify tool to automatically update the Prismic model change log page
Add to script: if your change contains queries is it in production?
Investigate using fetch links only and removing graph queries
Last updated