Incident retro - Internal model <=> Elastic index sync question
Incident from: 2020-01-07
Incident till: 2020-01-07
Retro held: 2020-01-15
Summary
And application (the API in this case) was deployed with an internal model that was incompatible with the index that the application's environment was pointing to.
e.g.
Where the v2-20191115
JSON cannot be deserialised into model v3
.
This lead the API to start returning 500s. It was particularly tricky as the piece of the model that changed wasn't present in many documents, so only errored when a work with the offending data in was surfaced as a result and the app attempted to deserialise it.
This was the change to the model for context.
This error also flagged up a number of other parts of the system that are not currently working to the level that we need them to.
Namely:
Timeline
Tue 7 Jan 2020
Change merged that added licences, changed one licence (the change couldn't be parsed)
Wed 8 Jan 2020 Day
Fix for relevancy deployed
Checks went through and seemed fine
23.20
Try to search “dinosaurs” on /works, and get an error page back.
Go to Slack, and check #wc-platform, #wc-data, #wc-experience and #wc-platform-alerts. No messages or alerts that would indicate an error.
Try to find the logs in the platform account. I can’t find the API at all.
Log into Elastic Cloud at cloud.elastic.co. Click on the logging cluster, try to launch Kibana … boom, no logs for me!
Realise the API is in the catalogue account now, go to terraform to hunt issues.
Get into the catalogue account, search the logs. Search for “500”, see a lot of 500 errors accumulating in the API Gateway logs and nginx.
Check the ECS tasks. Oh look, it got redeployed when the bugs started.
Write this all up in Slack and a GitHub ticket.
Thu 9 Jan 2020
Went through the above, tried rollback to potentially previously working version
Guessed that it may be the modl breaking, pointed app to staging index which fixed it
Put into code and committed
Analysis of causes
The Model-Index-Sync question
Currently we index into a new index when we start up a new pipeline, which is often triggered by a transformer or model change.
When then need to remember that any new changes to the API will have to reference this new model as it will be using the new model.
This has caused issues by going out of sync in the past.
Error reporting
CloudWatch alarms were triggered in the Catalogue AWS account from API gateway.
These then tried to post to a topic in the Platform AWS account, to which it had no permissions. The topic has the lambda subscribed to it that then posts to Slack.
This never happened due to the permission issue.
Thoughts
Should we decrease the amount of steps from Cloudwatch -> Us. This is currently
CloudWatch
Topic
Lambda
Slack
Error logging
The logs were non-descriptive and held in multiple places. This made it hard to to work out what the problem was.
Thoughts
The logs should describe the error, especially if we know that this error can occur
We should have the logs reporting into the Elastic search logging cluster
Deployment transparency and rollback
When looking to rollback it was not clear as to where to log back to. This information is available in the release tool infrastructure (Dynamo). When releases are made we also know through ECS events that they are occuring, as @alexwlchan did to debug this problem.
Actions
Last updated