Incident retro - merging
Incident from: 2021-01-05
Incident till: 2021-01-05
Retro held: 2021-01-06
Timeline
See https://wellcome.slack.com/archives/C8X9YKM5X/p1609857867098400
5 January 2021
11.57 Deployment of API to stage, then prod Checked some works on site and API, especially relations and hierarchies. Hard to know if the correct data is there as there is some data.
Output from weco-deploy:
release ID environment ID deployed date request by description
[snip] 90dffb14-3c8b-42f3-a312-e866f7a191b0 prod yesterday @ 11:57 n.ward@Wellcomecloud.onmicrosoft.com -
14.44 All of the 'View' buttons on archives & manuscripts digitised material seem to have vanished from wc.org/collections?
15.02 Jonathan mentioned Nick on Slack; suggested rolling back
15.10 James joined thread
15.12 Gareth identified problem with API output missing digitalLocation 15.17 Nick pinged Jamie, James and Alice
15.20 James noticed lack of merging
15.22 Jamie said that was due to ID minting
15.23 Nick suggests rolling back
15.41 James and Jamie agree to rolling back
15.44 Index changed back to older index
15.45 James posted a message in main #platform channel
15.59 Jamie suggests using old image, which was done. Old build redeployed
16.32 Correct image on staging
16.34 Pushed to prod
16.44 Nick messaged Alexandra to say everything is back to how it was
Analysis of causes
API data was checked on stage but difficult to know if the correct data is there.
An interface test on the front end would have been able to catch e.g. missing button Also use existing diff tool on output from API Both in deployment
Need to codify the critical things so you can test for those, e.g come up with a list of what we need to check for. Run those checks before a deployment of a reindex
Process
https://github.com/wellcomecollection/docs/blob/master/INCIDENTS.md Trying to assess impact upfront would have helped see this was an incident One person to organise efforts
Actions
How do we identify an incident?
Work out what is critical and list it so they can be checked. How do we keep this up to date? Define acceptance criteria for a release with representation from Product. Have that run as automated tests before releasing to prod. RK/JT/JG
We need a list of examples of works to look at. Put examples of what’s needed into e.g. Gitbook, or maybe integrate into dif tool. JP/NW
Speed up ECS deployment of new tasks for the API JP
Migration needed to fix the ontology-type issue NW/AFC
Last updated