Incident retro - Prismic model changes
Incident from: 2023-06-12
Incident until: 2023-06-12
Retro held: 2023-06-14
Timeline
See https://wellcome.slack.com/archives/C01FBFSDLUA/p1686565392141259
12 June 2023
11.18 RC deployed Prismic model
11.21 Someone on Content team published new content
11.23 Updown alerts for Homepage (origin) Homepage (cached) Stories (origin) Stories (cached)
JP Home page down Cannot find slice of type soundcloudEmbed Cannot find slice of type vimeoVideoEmbed Cannot find slice of type instagramEmbed Cannot find slice of type twitterEmbed Cannot find slice of type youtubeVideoEmbed
11.23 RC I've just changed the model to remove those I wasn't sure which order was required
11.24 AC “WARNING: If you are removing fields from a custom type, you must remove any queries for those fields from the content app and deploy the changes to the content app first, before deploying the changes to Prismic.” https://github.com/wellcomecollection/wellcomecollection.org/tree/main/prismic-model
We’ve been bitten by this before, I should have remembered when I reviewed the PR
RC It's deploying as we speak, I'll have it go to prod asap
[Rolling forward; est 13-15 mins until back up]
11.26 AC Note: you can fix immediately by checking out a version of prismic-model prior to your change And deploying the custom types to put those queried fields back So the currently-executing queries will start working again
RC yeah it's all of them though. or can you update them all in one go?
AC: I don’t think so. it’s a bit fiddly, but I think worth doing?
11.30 still 500ing although RC thought it should be back up; needed a new instance of Prismic content
11.34 AC I’ve published a change in Prismic (fixing a lint error) and now prod is back up
11.35 RC triggering prod deploy now
11.35 Updown up alerts for restored Stories (cached) Homepage (cached) Homepage (origin) Stories (origin)
11.42 AC so now prod is deployed, I think you should be safe to re-deploy the custom Prismic types
11.43 RC yeah was just waiting for e2es, in case will deploy asap
11.46 RC re-ran the type deploys and republished a random page in prismic, still up
Analysis of causes
We had an outage that started about half an hour ago, caused jointly by [1] deploying some changes to a custom model at ~11:15 and [2] somebody making a change in Prismic at 11:21 (when the site went down)
We identified the issue quickly in the logs, rolled back the changes to the Prismic model, and published another change at 11:34 to bring the site back up
Once we’d rolled forward the front-end apps, we redeployed the changes to the Prismic model
Unsafe deployment of changes to the model
Actions
Alex & Raphaëlle
Put a warning in the deploy tool to warn when deleting fields, including a prompt to publish
Change the deploy tool to update all types in one go
Raphaëlle & David
Improve readme documentation about how to remove fields
Last updated