Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - story page appearing then replaced by a 404

PreviousIncident retro - reporting cluster downtime and configuration lossNextIncident retro - increased rate of errors in searches on wellcomecollection.org

Last updated 10 months ago

Incident from: 2022-09-09

Incident until: 2022-09-09

Retro held: 2022-09-16

Timeline

See https://wellcome.slack.com/archives/C8X9YKM5X/p1662729661355199

14.21 DB identifies the initial issue

14.21 DB this is an odd one, via digital@. this page https://wellcomecollection.org/articles/YdQ6AhAAAJMQ5mEg (a story page, from 2022) is appearing, and then disappearing, to be replaced by a 404

14.31 AC made an initial guess at the fix, created PR #8434 Remember to serialise API results using superjson

14.55 PR is reviewed and being deployed

15.14 AC may need to do a CloudFront invalidation (JP said they would do it)

15.14 Initial fix is deployed and on staging + prod; similar issue identified

16.11 DB still seeing the error on a different page; this time an

16.13 JP patch needs to be applied to exhibitions too

16.15 JP confirms he’s run the CloudFront invalidation

16.22 AC I think the bug is somewhere in the interaction between fetchExhibitionRelatedContentClientSide in https://github.com/wellcomecollection/wellcomecollection.org/blob/main/content/webapp/services/prismic/fetch/exhibitions.ts and the API In particular I think we want to be using superjson to parse the response

16.31 AC The specific bug here is in the client-side fetchers we have for getting related content, eg the list of “read more” on an article That gets fetched asynchronously so as not to block the main page load, which means it doesn't go through getServerSideProps If the value includes a Date which gets serialised as a string, it may break the downstream components when they try to call date related methods on it We've fixed this in getServerSideProps with our superjson plugin, which knows how to serialise/deserialise Dates properly And when that was working, I removed all the casting of Date values to Date, because I thought it was redundant Thus exposing these API endpoints as a place where it wasn't being handled correctly

16.35 AC One thing we should capture as a result if this bug (next week!) is a list of URLs where these api endpoints return non-empty lists There’s been a comment for a while that says “I think this might be broken”, but lack of known examples made it hard to test

16.35 Issues deploying to prod because of failing e2e tests, due to a bug in the initial fix

16.39 JP pushed PR to prod

16.43 JP e2e failing still a JSON error; this is not as bad as the page disappearing so going to override e2e tests and push to prod fyi - the URL checker is noticing this error because it is uncaught; the other one is “helpfully” caught by Next.js and turned into a readable error message

16.45 Better fix opened that will fix e2e tests

This is an error in the original patch

16.47 JP oh my god, weco-deploy erroring locally “Expected maxsize to be an integer or None”

16.48 AC weco-deploy bug needs a newer Python

16.48 Better fix is merged and enters deployment process

16.48 JP merges the second PR #8436, which starts rolling out to stage

16.59 PR #8436 is deployed and fixed in stage 17.03 prod deployment of #8436 is happening

17.07 JP Also fixed for me

17.07 Incident done

Tuesday 13 September https://wellcome.slack.com/archives/C01FBFSDLUA/p1663075909007449 Enquirer reported they were having the same “Not the Wellcome you were expecting” error page. JP saw the same using Safari v 13. MR web site working in Safari v14; RC working in v15.

15.48 JP reported fixed

Analysis of causes

Bug in datetime handling logic, and not caught by tests.

We don’t have good visibility on client side errors.

Actions

Alex

  • Look at how to have more resilience around related content? API failure is handled correctly

  • Have a way for e2e tests to run and a person can override them for deployment

Alex and Danny

  • Make a note of example pages where related content appears so we can do manual testing

Gareth

  • Investigate an approach for client side errors

16.45 AC identified error: superjson expects text, not an already-parsed JSON object

#8436
exhibition page
Timeline
Analysis of causes
Actions