Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - 500'ing on the /images endpoint

PreviousIncident retro - cross cluster replicationNextIncident retro - home page and stories page not available

Last updated 10 months ago

Incident from: 2021-11-15

Incident until: 2021-11-15

Retro held: 2021-11-17

Timeline

15 November 2021

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1636992374000200

PR to add thumbnails to API

16:04 catalogue-api-gateway-5xx-alarm There were 4.0 errors in the API

Alert repeated every 2-3 minutes until 16.57

16.06 AC We are reliably 500'ing on the /images endpoint: https://api.wellcomecollection.org/catalogue/v2/images?aggregations=locations.license,source.genres.label,source.contributors.agent.label&pageSize=25&query=aids

,Attempt to decode value on failed cursor, List(DownField(type), DownField(thumbnail), DownField(data), DownField(canonicalWork), DownField(source)))

I suspect this is my change in internal model to make thumbnail a DigitalLocation – the API still expects that value to be a Location, so it’s looking for a type discriminator in type – but the ingestors are no longer setting that, because they see the type as unambiguous

The fix is to roll the version of internal model in the API merged a change to add the license aggregation to the /images API

16.07 AFC I deployed the pipeline thinking it wasn’t a breaking change AC also didn’t think it was a breaking change

16.08 JG We’re still serving OK - assuming it’s because it’s only on some records? AFC I think it’s on everything that has a thumbnail, so quite a lot (edited)

Couldn’t roll back because it’s on the pipeline side; have to roll the API forward

16.10 AC I would suggest:

  1. Roll the internal model in the API

  2. Deploy that

I’m 98% sure that will fix it And I can explain the issue in a bit more detail in a bit

16.26 AC / AFC Some image queries would return a persistent error or queries that include those images in the results

16.31 AC We used to model thumbnail: Location, so it would be serialised as {…, "type": "DigitalLocation"} – this is how the API knows which flavour of Location it should deserialise Now we model thumbnail: DigitalLocation so it doesn’t get serialised by the ingestor with the "type" value. Then the API code doesn’t know how to interpret it.

16:31 It’s not fixed yet, but it will be shortly

16.49 AC Fix is rolling out now, alerts should be silencing shortly

16.57 catalogue-api-prod-5xx-alarm There were 2.0 errors in the API

This was the final alert

17:04 NP No alerts for 5 mins ... all okay now?

17:25 AFC sorry natalie, it’s deployed so I think it’s fixed

Analysis of causes

The pipeline model had moved ahead of API: Change in internal model to make thumbnail a DigitalLocation – the API still expects that value to be a Location, so it’s looking for a type discriminator in type – but the ingestors are no longer setting that, because they see the type as unambiguous

Actions

No actions.

Timeline
Analysis of causes
Actions