Incident retro - 500'ing on the /images endpoint
Incident from: 2021-11-15
Incident until: 2021-11-15
Retro held: 2021-11-17
Timeline
15 November 2021
See https://wellcome.slack.com/archives/C01FBFSDLUA/p1636992374000200
PR to add thumbnails to API
16:04 catalogue-api-gateway-5xx-alarm There were 4.0 errors in the API
Alert repeated every 2-3 minutes until 16.57
16.06 AC We are reliably 500'ing on the /images endpoint: https://api.wellcomecollection.org/catalogue/v2/images?aggregations=locations.license,source.genres.label,source.contributors.agent.label&pageSize=25&query=aids
,Attempt to decode value on failed cursor, List(DownField(type), DownField(thumbnail), DownField(data), DownField(canonicalWork), DownField(source)))
I suspect this is my change in internal model to make thumbnail a DigitalLocation – the API still expects that value to be a Location, so it’s looking for a type discriminator in type – but the ingestors are no longer setting that, because they see the type as unambiguous
The fix is to roll the version of internal model in the API merged a change to add the license aggregation to the /images API
16.07 AFC I deployed the pipeline thinking it wasn’t a breaking change AC also didn’t think it was a breaking change
16.08 JG We’re still serving OK - assuming it’s because it’s only on some records? AFC I think it’s on everything that has a thumbnail, so quite a lot (edited)
Couldn’t roll back because it’s on the pipeline side; have to roll the API forward
16.10 AC I would suggest:
Roll the internal model in the API
Deploy that
I’m 98% sure that will fix it And I can explain the issue in a bit more detail in a bit
16.26 AC / AFC Some image queries would return a persistent error or queries that include those images in the results
16.31 AC We used to model thumbnail: Location
, so it would be serialised as {…, "type": "DigitalLocation"}
– this is how the API knows which flavour of Location it should deserialise Now we model thumbnail: DigitalLocation
so it doesn’t get serialised by the ingestor with the "type"
value. Then the API code doesn’t know how to interpret it.
16:31 It’s not fixed yet, but it will be shortly
16.49 AC Fix is rolling out now, alerts should be silencing shortly
16.57 catalogue-api-prod-5xx-alarm There were 2.0 errors in the API
This was the final alert
17:04 NP No alerts for 5 mins ... all okay now?
17:25 AFC sorry natalie, it’s deployed so I think it’s fixed
Analysis of causes
The pipeline model had moved ahead of API: Change in internal model to make thumbnail a DigitalLocation – the API still expects that value to be a Location, so it’s looking for a type discriminator in type – but the ingestors are no longer setting that, because they see the type as unambiguous
Actions
No actions.
Last updated