Incident retro - users cannot login to their accounts on wellcomecollection.org

Incident from: 2024-10-07

Incident until: 2024-10-10

Retro held: 2024-10-11

Timeline

Monday 7 October 2024

See https://wellcome.slack.com/archives/C02ANCYL90E/p1728486497236829

14.10 A catalogue-pipeline PR updating the Flyway package to version 7 is merged to main, triggering a deploy process. The id_minter service stops working shortly after.

Wednesday 9 October 2024

11.03 NC mentioned that edits to a Sierra record from 7 October are not yet showing up on our site in #wc-platform-feedback

RK checked dashboard and saw no problem with DL queue

~ 14.00 AR also reported updates not showing

RK checked his previous work with id minter

16.09 RK id minter has 1.09k messages on the queue - oldest message was from ~14.00 on Mon 7 Oct

16.13 RK updates are piling up at the id_minter [as a result of bumping flywaydb from 4.2 to 10.18]

RK increased the window for the queue to 14 days from 24 hours

Thu 10 October 2024

10.30ish Deploy delayed by failing ingestor images tests

11.50 Deployed PR to reconfigure to Flyway to use the previous schema version table name DL queue messages consumed and processed

Analysis of causes

Flyway stores its own table in the database for tracking schema changes. In version 4, the default name for this table was schema_version. However, in version 5, the default changed to flyway_schema_history.

We recently updated to Flyway 7, which resulted in Flyway not being able to find this table (because it was looking for the new default name).

PR deploy delayed by failing tests

Messages had been there for a day which shouldn’t happen

ID minter service wasn’t stable

Actions

RK

  • Manually re-run updates to pick up changes from Mon 7 afternoon for specific items if requested (in progress)

TBC - take to next planning

  • Run a reindex to pick up changes from Mon 7 afternoon

  • Update monitoring so this isn’t missed again

    • Surface date of oldest message

    • Show the status of deployment service

  • Extend message retention period e.g. 3/4 days, to save us if something like this happens on a Friday and we don't catch it till Monday

Last updated