# Incident retro - users cannot login to their accounts on wellcomecollection.org

**Incident from:** 2024-07-25

**Incident until:** 2024-07-25

**Retro held:** 2024-07-26

* [Timeline](#timeline)
* [Analysis of causes](#analysis-of-causes)
* [Actions](#actions)

## Timeline

### Thursday 25 July

Wellcome firewall migration

See <https://wellcome.slack.com/archives/C01FBFSDLUA/p1721903759248609>

10.00 NP mentioned Sierra and CALM issues in stand up, that users couldn’t log in to the desktop apps. But didn’t think it would affect our teams

10.14 Email to various library staff from Collections Systems Support (CSS) to restart laptops/refresh VPN connection if you have problems accessing Sierra and CALM

11.14 #wc-plaform-alerts:\
platform-lambda-error-alarm: sierra-adapter-20200604-sierra-reader-orders\
platform-lambda-error-alarm: sierra-adapter-20200604-sierra-reader-holdings\
platform-lambda-error-alarm: sierra-adapter-20200604-sierra-reader-items\
platform-lambda-error-alarm: sierra-adapter-20200604-sierra-reader-bibs

11.15 #wc-plaform-alerts: Catalogue API errors

11.19 #wc-plaform-alerts: identity-api-prod-5xx-alarm

11.20 #wc-plaform-alerts: Catalogue API errors\
11.22 #wc-plaform-alerts: Catalogue API errors until last at 12.17

Failed login errors until last at 12.11

11.24 CSS email saying still working on Sierra desktop client issue but users can log in to the browser version

11.33 RC reported users couldn’t login on the web site

11.36 RC @weco\_devs starting a chat here; Logging in is down, and we're having issues with the Catalogue API

11.37 RC It is, in turn, what fails our e2es so we can not worry about that for now. I reverted the FE code back a few days and it seems to behave the same + the code passed without problem locally, in the e2e env, as well as in staging.\
If we find out it is causing issue though, we can always revert.

11.40 NP reported to CSS users can’t login

11.41 NP It looked like it was just an issue with accessing the Sierra desktop app (people have been logging in to the Sierra web app). I've asked for more info on Sierra now.

11.42 RC reverted last PR in case it affected API errors

11.44 SB We're working on fixing the logging issue, it's probably caused by the logging cluster running out of storage.\
Not sure if the logging issue is related to the issues with the Catalogue API, but I can't see how those two would be connected

11.45 RC so there is the logging issue (kibana) and the log in issue (logging into the website), which I'm wondering if it's related to Sierra's issues.

11.45 NP opened statuspage incident

12.03 NP confirmed there’s no status page for Sierra and we have to hear from CSS for updates\
RC suggested a banner on the site to tell users of no requesting. NP confirmed on-site users know of the issue from library staff

12.06 RC debugging out loud\
<https://api.wellcomecollection.org/catalogue/v2/works/xswz3swa/items\\>
is it a valid url to query in the first place?\
it's getting

```
{
"errorType": "http",
"httpStatus": 403,
"label": "Invalid API key",
"description": "Forbidden",
"type": "Error"
}
```

12.09 AG this is fine for the same work/item <https://wellcomecollection.org/works/xswz3swa/items\\>
The url you're trying to hit needs an API key to access, so can't be done freely through the browser

12.10 RC We're getting similar ones repeatedly at the moment. The page is accessible <https://wellcomecollection.org/works/xswz3swa/items>

12.12 GE my understanding is that Auth0 needs to talk to Sierra

12.13 RC On the identity/aut0 page, we currently ask people to email the Library email when there's a problem with them logging in, that's interesting.

12.18 CSS report also having an issue with Sentry/Sierra so web site isn’t the only Sierra client with an issue. CSS have a call open with Sierra’s supplier

12.18 AG reported logging into library account.\
NP confirmed and could make a successful request.

12.20 RC The last \[API] alerts errors were fewer

12.22 AG I checked a few of the works that are throwing cloudfront errors, they're all findable/accessible online so could it just be some bots that are trying to go straight to api.wellcomecollection.org and failing for lack of api key (as they should!)

12.24 AG the /items endpoint does involve Sierra but I thought we were only making the call if the user was logged in

12.25 RC I'm relaunching my e2es and they pass, so I'll un-revert my PR.

12.37 CSS report the API issue appears to be resolved

12.39 NP closed statuspage incident

13.06 SB Logging cluster is back up and running. It's still deleting lots of documents in the background so it might be slower than usual, but I'm able to log in and use it normally now

## Analysis of causes

### What happened that we didn’t anticipate?

Sierra issue (via CSS): “D\&T Networks team made a change on the network earlier this morning to resolve another issue, and it appears that III needed to restart the API junction to pick up the change.” and “apparently a knock-on from a Wellcome firewall migration yesterday.” i.e. Thursday 25 July

### Why didn't our safeguards catch this?

We did get alerts re Sierra adapter and the catalogue API - but not until much later than the incident starting

Logging not available for looking at the issue - we were already aware of this and working to restore access.

Banner alert to tell site users that requesting/logging in wasn’t available

* Who decides when to do it? LE\&E managers? Us?
* Who puts it up? Editorial? Us?
* LE\&E managers unclear that Editorial can/should put up the alert

## Actions

**LB**

* Review current and desired behaviour for requestable items
* Review for optimisation of code/calls to Sierra

**SB**

* Investigate Slack alerts/dashboard for the logging cluster becoming unhealthy
* Document in GitBook how to address logging issues using the web-based console

**AG**

* Check logs (if they exist)
* What was wrong with Sierra and when - talk to Collections Systems Support

**NP**

* Talk to Digital Content Manager and LE\&E manager re putting up banners on the site for incidents
* Create incident reference page in Notion e.g. Prismic or non-Prismic banner (for if NP is not around/so everyone can see what to do)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.wellcomecollection.org/incident-retros/2024-07-25_users_cannot_login_to_wellcomcollection_org.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
