Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - search not available

PreviousIncident retro - slow searchNextIncident retro - search not available

Last updated 10 months ago

Incident from: 2021-07-27

Incident until: 2021-07-27

Retro held: 2021-07-29

Timeline

27 July 2021

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1627393042010200

14.37 Updown alert DOWN for:

  • Front End Works Search (Origin)

  • Works API Single Work

  • Works API Search

  • Images API Single Image

  • Images API Search

  • Front End Single Work Page (Origin)

  • Front End Single Work Page (Cached)

  • Front End Works Search (Cached)

  • Wellcome Images Redirect

JP asked if anyone knew what was going on

JG: I just ran the user script RK: which one?

14:38 Does that update existing users ./scripts/create_elastic_users_catalogue.py https://github.com/wellcomecollection/catalogue-api/blob/main/terraform/shared/scripts/create_elastic_users_catalogue.py

14.39 JP: it does [update existing users]; it’s an elastic auth error

Sending HTTP 500 from ElasticsearchErrorHandler$ (Unknown error; ElasticError(security_exception,unable to authenticate user [search] for REST request

RK: cycling the api services will cause them to come back with the new secrets?

14.40 JG: That should work Or we can try update the passowrd, but I am not sure where from. Happy to unless someone else is doing it>?

Done by RK?

14.43 RK: the logs look ok have to wait for the new tasks to register JP: I think we need to wait for the target groups to become healthy, no?

14.44 Statuspage alert sent

14: 45 RK did the same for items and staging

14.44 Updown alert UP for:

  • Front End Single Work Page (Origin)

  • Images API Search

  • Front End Works Search (Origin)

  • Works API Single Work

  • Works API Single Image

  • Works API Search

  • Wellcome Images Redirect

  • Front End Single Work Page (Cached)

  • Front End Works Search (Cached)

14.46 JG: Do we know why stage-search is still 3?

RK: because it never got scaled back down?

14.47 Statuspage alert sent - issue resolved

RK: related discussion: https://wellcome.slack.com/archives/C3TQSF63C/p1625487082309500

RK: I’m very tempted to remove the terraform script provisioner for the catalogue-api work, as I think there are some situations where it is dangerous. We could for example update the password for search inadvertently by updating a dependent TF resource - that would trigger a password update that running services would not pick up automatically.

that script should prevent you from dropping the old passwords if they exist, or at least prompt you.

Analysis of causes

  • We ran a user script that updated users without fully understanding what the script did, and had stuff in it that would break

Actions

JG

  • https://github.com/wellcomecollection/catalogue-api/pull/212 - DONE

  • https://github.com/wellcomecollection/catalogue-api/pull/208 - DONE

  • https://github.com/wellcomecollection/catalogue-api/pull/211 - DONE

  • Continue to work on the user provisioning so that it works safely as expected, including password rotation

RK

  • Alerts for requesting and items

JP

  • Scale down the stage catalogue API ECS services

  • Investigate why desired_count in ECS services does not update ECS

Timeline
Analysis of causes
Actions