# Incident retro - users cannot login to their accounts on wellcomecollection.org

**Incident from:** 2024-10-07

**Incident until:** 2024-10-10

**Retro held:** 2024-10-11

* [Timeline](#timeline)
* [Analysis of causes](#analysis-of-causes)
* [Actions](#actions)

## Timeline

### Monday 7 October 2024

See <https://wellcome.slack.com/archives/C02ANCYL90E/p1728486497236829>

14.10 A catalogue-pipeline PR updating the Flyway package to version 7 is merged to main, triggering a deploy process. The id\_minter service stops working shortly after.

### Wednesday 9 October 2024

11.03 NC mentioned that edits to a Sierra record from 7 October are not yet showing up on our site in #wc-platform-feedback

RK checked dashboard and saw no problem with DL queue

\~ 14.00 AR also reported updates not showing

RK checked his previous work with id minter

16.09 RK id minter has 1.09k messages on the queue - oldest message was from \~14.00 on Mon 7 Oct

16.13 RK updates are piling up at the id\_minter \[as a result of bumping flywaydb from 4.2 to 10.18]

RK increased the window for the queue to 14 days from 24 hours

Thu 10 October 2024

10.30ish Deploy delayed by failing ingestor images tests

11.50 Deployed PR to reconfigure to Flyway to use the previous schema version table name\
DL queue messages consumed and processed

## Analysis of causes

Flyway stores its own table in the database for tracking schema changes. In version 4, the default name for this table was schema\_version. However, in version 5, the default changed to flyway\_schema\_history.

We recently updated to Flyway 7, which resulted in Flyway not being able to find this table (because it was looking for the new default name).

PR deploy delayed by failing tests

Messages had been there for a day which shouldn’t happen

ID minter service wasn’t stable

## Actions

**RK**

* Manually re-run updates to pick up changes from Mon 7 afternoon for specific items if requested (in progress)

**TBC - take to next planning**

* Run a reindex to pick up changes from Mon 7 afternoon
* Update monitoring so this isn’t missed again
  * Surface date of oldest message
  * Show the status of deployment service
* Extend message retention period e.g. 3/4 days, to save us if something like this happens on a Friday and we don't catch it till Monday


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.wellcomecollection.org/incident-retros/2024-10-07_updates_not_getting_through_to_works.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
