Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • 25 February 2024
  • 26 February
  • Analysis of causes
  • Actions

Incident retro - search not available

PreviousIncident retro - web site not availableNextIncident retro - search not available

Last updated 10 months ago

Incident from: 2024-02-25

Incident until: 2024-02-26

Retro held: 2024-02-26

Timeline

See https://wellcome.slack.com/archives/C01FBFSDLUA/p1708937288732099

25 February 2024

17.11 Updown alert down: What’s on, Exhibition, Concept 17.12 Up: What’s on, Exhibition, Concept 22.09 Updown alert down: What’s on Up: What’s on plus numerous CloudFront 5xx errors detected

26 February

08.48 RK Getting 50x for the collection search, and there are lots of errors in the alerts channel. [confirmed by DM and NP]

08.51 RK Lots of load on the api this morning. These are log events only, so that might just be the errors edit: it is - not especially high load - just a lot of errors starting around 1am.

08.59 RK Lots of errors all look like they are coming from the same container id: 58877c4d38434478a490e29e745f360b. I'm going to kill the bad task and see if that clear things up A bit suspicious that the task that all the errors are coming from is 4 days old and the other 7 hours old.

09.03 RK Also worth repeating that bytespider traffic is back somehow. This is a search for the user-agent across all services but the logs are all from frontend-prod

09.07 RK I killed the task that was throwing errors and it looks like that may have resolved the problem NP/JC confirm no errors when searching.

09.08 RK Events from the search-api service, you can see 2 of the 3 tasks throwing errors duck at about 1 am to be replaced by healthy tasks, and me killing the last one just one.

09.10 RK Working hypothesis is that high load caused the search service to get into a bad state, and our health-checks are not good enough to recognise that this last task needed booting (though I suspect recent changes made them better).

09.17 RK To recap - over the weekend there were updown notifications at ~5pm and 10pm (see the alerts channel) yesterday that are in line with the increased traffic from the bytespider bot. Then from approximately 1:30am we start seeing errors in the search API. And at about 2am 2/3 of the tasks associated with the search service restarted.

09.26 RK Digging into the errors on the troubled task com.sksamuel.elastic4s.http.JavaClientExceptionWrapper: java.util.concurrent.CancellationException: Request execution cancelled

looks like an issue talking to ES

Analysis of causes

  • High load caused the search service to get into a bad state, and our health-checks are not good enough to recognise that this last task needed booting

  • Three tasks serving search. Two were restarted after a load balancer health check but third didn’t (it was healthy enough to look as if it was alright).

Actions

Robert

  • Investigate why bot traffic is still reaching our service

Natalie

  • Take to planning: Extend load balancer health checks or search API to fail if it can’t connect to ES (including Investigate Elasticsearch timeout can be fatal to the ingestor. #2268)

Agnes

  • Check if updown checks the catalogue API and reports that in Slack

Related to: Elasticsearch timeout can be fatal to the ingestor. ?

#2268
Timeline
Analysis of causes
Actions