APM

Some guidance for what you can do with the catalogue's APM (Application Performance Monitoring).

Some things you can track with APM:

  • Errors (with nice stack traces). You can see an incident (elevated 500s rate) that we had here. There's an alerting tool (watcher) which can take a Slack web hook - we should probably use this!

  • JVM stats - good for discovering memory leaks and also fun for garbage collection enthusiasts. An example.

The meat of APM is transaction monitoring: for us, that's monitoring the performance of endpoints. For example, for /works we can see quite a lot of data.

  • The average duration of a request is about 50ms

  • The 99th percentile is usually around 150ms, but is quite noisy

  • We have some outlier requests where it looks like the API stalls, and/or the network connection to elastic was very slow - might be worth looking into these!

All APM data is stored in Elastic and we can do our own analyses - here's a dashboard for comparing the performance of aggregations and "normal" queries.

Last updated