Incident Retros
  • Incident retro - Internal model <=> Elastic index sync question
  • Incident retro - merging
  • Incident retro - Miro images
  • Incident retro - January downtime
  • Incident retro - Elastic Cloud
  • Incident retro - stories and home page down
  • Incident retro - search not available
  • Incident retro - ingestors
  • Incident retro - home page with json
  • Incident retro - slow search
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - cross cluster replication
  • Incident retro - 500'ing on the /images endpoint
  • Incident retro - home page and stories page not available
  • Incident retro - home page and what's on not available
  • Incident retro - requests not showing in account
  • Incident retro - requests not showing in account
  • Incident retro - works search errors
  • Incident retro - date picker
  • Incident retro - requests not showing in account
  • Incident retro - reporting cluster downtime and configuration loss
  • Incident retro - story page appearing then replaced by a 404
  • Incident retro - increased rate of errors in searches on wellcomecollection.org
  • Incident retro - slow search due to 900k messages on the ingestor queue
  • Incident retro - concept pages not available
  • Incident retro - Prismic model changes
  • Incident retro - Images search down
  • Incident retro - wc.org intermittently available
  • Incident retro - web site not available
  • Incident retro - search not available
  • Incident retro - search not available
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - users cannot login to their accounts on wellcomecollection.org
  • Incident retro - digital assets not available
Powered by GitBook
On this page
  • Timeline
  • Analysis of causes
  • Actions

Incident retro - slow search

PreviousIncident retro - home page with jsonNextIncident retro - search not available

Last updated 10 months ago

Incident from: 2021-06-28

Incident until: 2021-06-28

Retro held: 2021-06-29

Timeline

28 June 2021

11.51 AC kicked off reindex into 06-28 index

13.28 Reindex kicked off a second time

Restarted later for a third time

15.16 Alert Down: Front End Works Search (Origin) Recovery: Front End Works Search (Origin)

CCR was paused, which reduced load on API cluster, which caused the issue to stop

15.21 Comms sent out via status page

Waited 5 mins to check the issue was resolved

15.33 Incident resolved

Analysis of causes

  • Hard to keep an eye on reindexing

  • Looked okay in the morning, because the ingestor wasn’t doing anything, which wasn’t obvious

  • Cross-cluster replication whilst reindexing

  • How easy is it to find performance metrics?

Actions

AC

  • Improve the following/unfollowing process

RK

  • Document the process

  • Include links to performance metrics that work for anyone who needs them

JG

  • Propagate Elastic alerts to Slack:

    • Alert on CPU load

    • Alert on CCR

Timeline
Analysis of causes
Actions