Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • API data was checked on stage but difficult to know if the correct data is there.
  • Process
  • Actions

Incident retro - merging

PreviousIncident retro - Internal model <=> Elastic index sync questionNextIncident retro - Miro images

Last updated 10 months ago

Incident from: 2021-01-05

Incident till: 2021-01-05

Retro held: 2021-01-06

Timeline

See https://wellcome.slack.com/archives/C8X9YKM5X/p1609857867098400

5 January 2021

11.57 Deployment of API to stage, then prod Checked some works on site and API, especially relations and hierarchies. Hard to know if the correct data is there as there is some data.

Output from weco-deploy:

release ID environment ID deployed date request by description


[snip] 90dffb14-3c8b-42f3-a312-e866f7a191b0 prod yesterday @ 11:57 n.ward@Wellcomecloud.onmicrosoft.com -

14.44 All of the 'View' buttons on archives & manuscripts digitised material seem to have vanished from wc.org/collections?

15.02 Jonathan mentioned Nick on Slack; suggested rolling back

15.10 James joined thread

15.12 Gareth identified problem with API output missing digitalLocation 15.17 Nick pinged Jamie, James and Alice

15.20 James noticed lack of merging

15.22 Jamie said that was due to ID minting

15.23 Nick suggests rolling back

15.41 James and Jamie agree to rolling back

15.44 Index changed back to older index

15.45 James posted a message in main #platform channel

15.59 Jamie suggests using old image, which was done. Old build redeployed

16.32 Correct image on staging

16.34 Pushed to prod

16.44 Nick messaged Alexandra to say everything is back to how it was

Analysis of causes

API data was checked on stage but difficult to know if the correct data is there.

An interface test on the front end would have been able to catch e.g. missing button Also use existing diff tool on output from API Both in deployment

Need to codify the critical things so you can test for those, e.g come up with a list of what we need to check for. Run those checks before a deployment of a reindex

Process

https://github.com/wellcomecollection/docs/blob/master/INCIDENTS.md Trying to assess impact upfront would have helped see this was an incident One person to organise efforts

Actions

How do we identify an incident?

Work out what is critical and list it so they can be checked. How do we keep this up to date? Define acceptance criteria for a release with representation from Product. Have that run as automated tests before releasing to prod. RK/JT/JG

We need a list of examples of works to look at. Put examples of what’s needed into e.g. Gitbook, or maybe integrate into dif tool. JP/NW

Speed up ECS deployment of new tasks for the API JP

Migration needed to fix the ontology-type issue NW/AFC

Timeline
Analysis of causes
Actions