Incident retro - slow search

Incident from: 2021-06-28

Incident until: 2021-06-28

Retro held: 2021-06-29

Timeline

28 June 2021

11.51 AC kicked off reindex into 06-28 index

13.28 Reindex kicked off a second time

Restarted later for a third time

15.16 Alert Down: Front End Works Search (Origin) Recovery: Front End Works Search (Origin)

CCR was paused, which reduced load on API cluster, which caused the issue to stop

15.21 Comms sent out via status page

Waited 5 mins to check the issue was resolved

15.33 Incident resolved

Analysis of causes

  • Hard to keep an eye on reindexing

  • Looked okay in the morning, because the ingestor wasn’t doing anything, which wasn’t obvious

  • Cross-cluster replication whilst reindexing

  • How easy is it to find performance metrics?

Actions

AC

  • Improve the following/unfollowing process

RK

  • Document the process

  • Include links to performance metrics that work for anyone who needs them

JG

  • Propagate Elastic alerts to Slack:

    • Alert on CPU load

    • Alert on CCR

Last updated