> For the complete documentation index, see [llms.txt](https://docs.wellcomecollection.org/incident-retros/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.wellcomecollection.org/incident-retros/2022-10-04_slow-search-due-to-900k-messages-on-the-ingestor-queue.md).

# Incident retro - slow search due to 900k messages on the ingestor queue

**Incident from:** 2022-10-04

**Incident until:** 2022-10-04

**Retro held:** 2022-10-04

* [Timeline](#timeline)
* [Analysis of causes](#analysis-of-causes)
* [Actions](#actions)

## Timeline

4 October 2022

See <https://wellcome.slack.com/archives/C01FBFSDLUA/p1664871483404409>

\~02.00 900k messages on the ingestor queue

09.00 User reported issue with search and posted an internal server error image. Very slow search confirmed by NP.

09.25 AC cluster claims healthy, but I’d guess it’s under sustained load somehow

09.36 AC issue found the issue is that the works ingestor is hammering the cluster

009:38 AC okay, think I've applied a fix\
the basic issue is that there are 900k messages on the ingestor queue\
the ingestor is the app that populates the API index\
if we send too many writes to Elasticsearch, the cluster will struggle to respond to incoming requests (aka users)\
and it's been saturated since \~2am this morning, which is when a Calm \~> Sierra record harvest occurs (this is to allow items catalogued in Calm to be ordered through Sierra)\
most of those are a no-up change for us, the only update is the "last synced from Calm" field which we don't expose on the front-end, so we filter out the errors\
but this tiny change will have caused everything to get re-sent: <https://github.com/wellcomecollection/catalogue-pipeline/pull/2212\\>
\#2212 MeSH, not MESH\
If you look at the NLM website, it's a lowercase 'e':

09:42 AC API seems to be back for me

09.44 NP confirmed that search is running fast again

## Analysis of causes

Unexpected load from the overnight Sierra harvest, which in turn caused:\
the basic issue is that there are 900k messages on the ingestor queue\
the ingestor is the app that populates the API index\
if we send too many writes to Elasticsearch, the cluster will struggle to respond to incoming requests (aka users)

## Actions

* If queue crosses e.g 1.5 mill messages, stop the ingestor and send a message to Slack (check SQS metrics to determine the threshold). Discussed but decided not to do.

**Alex**

* Investigate why a label change caused this problem

**Paul**

* Investigate the retention time on the queues when not reindexing


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.wellcomecollection.org/incident-retros/2022-10-04_slow-search-due-to-900k-messages-on-the-ingestor-queue.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
