Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - home page and stories page not available

PreviousIncident retro - 500'ing on the /images endpointNextIncident retro - home page and what's on not available

Last updated 10 months ago

Incident from: 2021-12-06

Incident until: 2021-12-06

Retro held: 2021-12-07

Timeline

6 December 2021

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1638805397000300

15.42 updown alerts for Experience: Content: Stories (origin) Experience: Content: Stories (cached) Experience: Content: Homepage (cached) Experience: Content: Homepage (origin)

15.43 JP last deploy was 4 hours ago. Anyone terraforming?

15.44 JG Might be that we aren’t catering for certain prismic shapes, turns out there’s some oddities in the Prismic types

15.45 JP Lots of 400s from Prismic

15.46 JG confirmed it was a model change … It’s a grapQL thing https://wellcomecollection.cdn.prismic.io/api/v2/documents/search?page=1&pageSize=20&ordering[…]D%5Bnot(document.tags%2C%20%5B%22delist%22%5D)%5D%5D The application is asking for excerpt but we removed it. So a deploy should fix it.

15.47 JP why is stage working? Suggested rolling forward.

15.48 AC did the prod deploy succeed? JP yes JG It’s looking for the excerpt field. That doesn’t exist. Normally that’s not an issue using fetchLinks, apparently it is using GraphQL, which the homepage uses.

15.49 JP I’m going to deploy stage -> prod unless anyone objects

15.51 JG way to have avoided this would be:

  • remove the query for excerpts

  • Deploy

  • Remove the prop from the model

  • Deploy to prismic

15.53 updown recovery for Experience: Content: Stories (origin) Experience: Content: Homepage (cached) Experience: Content: Homepage (origin) Experience: Content: Stories (cached)

15.54 Home and stories page back and stable

Analysis of causes

The application is asking for the excerpt field. That doesn’t exist. Normally that’s not an issue using fetchLinks, apparently it is using GraphQL, which the homepage uses.

Also, deploy to prod that AC asked about didn’t happen.

It is possible for an application that’s running to become inconsistent if the model is updated. Model has to be updated first, then the app has to be updated.

Actions

Gareth E

  • Document correct order of steps for updating the Prismic model so we don’t break the site.

Timeline
Analysis of causes
Actions