Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - home page and what's on not available

PreviousIncident retro - home page and stories page not availableNextIncident retro - requests not showing in account

Last updated 10 months ago

Incident from: 2022-02-24

Incident until: 2022-02-24

Retro held: 2022-03-01

Timeline

23 February 2022 PR merged Remove place from event (Prismic model and app types)

The model change happened at about 17:00 on the 23rd Feb

24 February 2022

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1645692506693739

08.47 Alert down for:

  • Experience: Content: Homepage (cached)

  • Experience: Content: Homepage (origin)

  • Experience: Content: What’s on (cached)

  • Experience: Content: What’s on (origin)

08.48 AC I reckon this is probably @dmc’s change to the Prismic model – the deployed code is still looking for “place”, but it can’t find it because we’ve removed it And deploying the front-end is blocked because it’s broken in a different way I believe this fix will get the deployment unblocked: https://github.com/wellcomecollection/wellcomecollection.org/pull/7704

08.50 AC I’m going to merge #7704 by fiat and we can review it later, to unblock the deployment and get the homepage working again Specifically: we check a set of URLs for 200'ing as part of the deployment process, which you can see here: https://buildkite.com/wellcomecollection/experience-deployment/builds/657#9cf35e09-cf96-467b-becf-2fdaf1125598 One of those URLs is 500'ing, which means we shouldn’t be allowed to deploy… but it’s not as bad as the prod front-end, which is totally broken

08.52 Statuspage update created for home page and events not available

09.06 AC: still broken https://buildkite.com/wellcomecollection/experience-deployment/builds/658#7e887723-5ecf-4ac7-8a1b-8bba6add23c4

DMc but I’m wondering if putting place back into the Prismic model would make sense 09.10 AC depends how long it would take to propagate I suspect the issue is that Prismic has only just propagated the model change you applied yesterday

09.25 PB Is it relevant that place is mentioned in this graphQuery?

09.26 DMc I think this is another possible issue https://github.com/wellcomecollection/wellcomecollection.org/pull/7705/files#diff-0d61a4c2769aa8f2991eec03d5ea6[…]ff8cd8a6b02a6def9ff904185f3befR221

I had thought that putting place back on the Prismic model would be fairly instant, but perhaps Alex was right about propagation in any event, this change brings the site back up locally for me PB’s review of the above isn’t powerful enough

09.32 DMc couldn’t add PB to https://github.com/orgs/wellcomecollection/teams/js-ts-reviewers

09.37 DMc couldn’t force the PR or push to main

09.38 Then I suppose we must wait for @Gareth Estchild or for the interviews to finish

09.40 PB asked JT to add him to https://github.com/orgs/wellcomecollection/teams/js-ts-reviewers / done at 09.44

09.49 PB While we wait for that to go through - Is the underlying cause of this down to the delay in Prismic propagation? i.e. that this all seemed fine and smoke tested perfectly because Prismic was still using the old model, so the queries using the deleted fields completed successfully, but then once it was deployed, and Prismic updated, it all went a bit wrong

10.07 Recovery for:

  • Experience: Content: What’s on (origin)

  • Experience: Content: What’s on (cached)

  • Experience: Content: Homepage (cached)

  • Experience: Content: Homepage (origin)

10.11 Statuspage update: resolved

Analysis of causes

Is the underlying cause of this down to the delay in Prismic propagation? i.e. that this all seemed fine and smoke tested perfectly because Prismic was still using the old model, so the queries using the deleted fields completed successfully, but then once it was deployed, and Prismic updated, it all went a bit wrong

Reading between the lines in the Prismic docs and an old Prismic support thread, coupled with the timing of content being published and the site going down (see below). It looks like the model updates are only reflected in the API response once a piece of content has been published.

N.B. The model change happened at about 17:00 on the 23rd Feb

We need to remember to publish something, when we change the model.

A quicker fix in this type of scenario therefore, would be to change the model back to its previous state and publish something, to see those changes in the API.

PB couldn’t approve DMcs PR

DM couldn’t add PB to https://github.com/orgs/wellcomecollection/teams/js-ts-reviewers

DMc couldn’t force the PR

Actions

Alex

  • All developers should be able to merge pull requests

  • Add dev permissions to onboarding/leaving checklist

  • Investigate splitting out the experience build

  • Make the monitoring lambdas vend a prefilled link to the logging cluster

David

Gareth

09.23 DMc PR:

Make the e2e tests go faster: If they were faster, we'd be able to recover from broken builds a lot faster.

Think about having a staging version of Prismic () and talk to other developers about what has been found out

Document how to re-add deleted fields in Prismic, and update a piece of content arbitrarily.

Add a message to the diff tools script re updating content after field deletion from the model.

Make a ticket: Investigate all uses of graph query, and make sure we’re only using it where needed

#7705 Stop 500ing
#7706
development environment for Prismic
#7735
#7737
#7736
#7697
Timeline
Analysis of causes
Actions