Incident retro - merging

Incident from: 2021-01-05

Incident till: 2021-01-05

Retro held: 2021-01-06

Timeline

See https://wellcome.slack.com/archives/C8X9YKM5X/p1609857867098400

5 January 2021

11.57 Deployment of API to stage, then prod Checked some works on site and API, especially relations and hierarchies. Hard to know if the correct data is there as there is some data.

Output from weco-deploy:

release ID environment ID deployed date request by description


[snip] 90dffb14-3c8b-42f3-a312-e866f7a191b0 prod yesterday @ 11:57 n.ward@Wellcomecloud.onmicrosoft.com -

14.44 All of the 'View' buttons on archives & manuscripts digitised material seem to have vanished from wc.org/collections?

15.02 Jonathan mentioned Nick on Slack; suggested rolling back

15.10 James joined thread

15.12 Gareth identified problem with API output missing digitalLocation 15.17 Nick pinged Jamie, James and Alice

15.20 James noticed lack of merging

15.22 Jamie said that was due to ID minting

15.23 Nick suggests rolling back

15.41 James and Jamie agree to rolling back

15.44 Index changed back to older index

15.45 James posted a message in main #platform channel

15.59 Jamie suggests using old image, which was done. Old build redeployed

16.32 Correct image on staging

16.34 Pushed to prod

16.44 Nick messaged Alexandra to say everything is back to how it was

Analysis of causes

API data was checked on stage but difficult to know if the correct data is there.

An interface test on the front end would have been able to catch e.g. missing button Also use existing diff tool on output from API Both in deployment

Need to codify the critical things so you can test for those, e.g come up with a list of what we need to check for. Run those checks before a deployment of a reindex

Process

https://github.com/wellcomecollection/docs/blob/master/INCIDENTS.md Trying to assess impact upfront would have helped see this was an incident One person to organise efforts

Actions

How do we identify an incident?

Work out what is critical and list it so they can be checked. How do we keep this up to date? Define acceptance criteria for a release with representation from Product. Have that run as automated tests before releasing to prod. RK/JT/JG

We need a list of examples of works to look at. Put examples of what’s needed into e.g. Gitbook, or maybe integrate into dif tool. JP/NW

Speed up ECS deployment of new tasks for the API JP

Migration needed to fix the ontology-type issue NW/AFC

Last updated