Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - requests not showing in account

PreviousIncident retro - home page and what's on not availableNextIncident retro - requests not showing in account

Last updated 10 months ago

Incident from: 2022-05-13

Incident until: 2022-05-13

Retro held: 2022-05-13

Timeline

13 May 2022

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1652434124540789

01.33 Viewing, making requests, seeing if something is requestable not available

08.13 AC errors in the identity (frontend) app

10.24 AC 500s from identity API (can’t view item requests)

10.28 JP: can’t view item requests on prod, nor make a request

10.31 JP checked the API, the authorizer. Neither suspicious

10.37 SSL is the root cause. Attempting to fix by creating a new certificate in the console.

10.47 Cert validation record does exist but not liked by AWS

10.51 AWS Certificate Manager cert validation most likely the underlying cause. JP created the record set.

10.52 “One or more domain names has failed validation due to a certificate authority authentication (CAA) error. Learn more.”

10.52 AC: At some point AWS “forgets” your validation records and stops renewing certs

10.59 JP Fixed. Confirmed by AC/NP

Analysis of causes

SSL certificate was out of date

SSL certificate wasn’t automatically renewed

Also to be looked at: noisy alerts channel

Actions

Jamie

Alex

  • Add Cloudwatch log URL to alerts to take you to the right account with text added to help with debugging

  • Turn on CloudFront logging (with filtering)

Handle identity API proxy errors which don’t have a response - DONE

#7970
notices
notices
Timeline
Analysis of causes
Actions