RFC 022: Logging
Last updated
Last updated
Last updated: 20 March 2020.
We use logs to monitor running services following the Log aggregation pattern.
Logs are human readable semi-structured text output. Having visibility of what our services are up lets us more easily make decisions about how to react to their behaviour.
A logging service should be easy to query, reliable, low cost and simple to configure for clients.
In order to see what's going on easily across the estate, logs from all applications should be searchable in one place.
We should annotate logs with:
The originating product, service and environment e.g. [catalogue-pipeline, id_minter, stage]
A release reference e.g. the container image tag items_api:31982ae519bc6a5856cf8c078573a320aaca69bf
Previously we have used the awslogs
Log Driver in ECS which pushes logs into CloudWatch Logs.
This mechanism allows the application to care very little about how logging is done. Configuration can be made part of a service infrastructure template.
We don't need to worry about running any of our own log collection and searching infrastructure.
We have found CloudWatch Logs to be expensive for us when publishing millions of log lines. We haven't found an easy way to query logs across services and over periods of time longer than a few days.
In order to reduce costs and allow us to use ElasticSearch a familiar and well suited log querying solution, we have moved to using an "ELK" (Elasticsearch, Logstash, Kibana) stack.
Each application incorporates the Logback library and must carry the configuration about how to log to Logstash. Each cluster has a logstash-transit
service to allow logs to be relayed to Elasticsearch running in Elastic Cloud.
The logstash-transit
service provides log collection for a particular cluster and handles Elastic Cloud authentication and holds that configuration in one place for that cluster.
Elasticsearch and Kibana provide an effective mechanism for aggregating and searching Logs. We have found that querying logs in this way is much faster and more effective than using CloudWatch.
We are managing a significant portion of our own logging infrastructure, using language specific libraries and running multiple log collectors.
Each application must contain the Logback library and be configured to send logs to the local logstash-transit
service. We don't have an effective mechanism for standardising the configuration of both applications and logstash services. As a consequence of inconsistent configuration, services logs are not reliably available across the platform.
There is also proliferation of logstash-transit
services in the platform adding cost and complexity.
If we can take advantage of the container level logging drivers available in ECS we can return to a position where our applications have no configuration for transporting logs elsewhere. This would eliminate some of the current configuration and infrastructure complexity.
At time of writing the logging drivers available for the AWS Fargate containers that make up the vast majority of our services are limited to awslogs
, splunk
, and awsfirelens
.
awsfirelens
is a solution for custom log routing that makes use of a Fluent Bit or Fluentd container as a sidecar to your application container.
AWS provides pre-built images for running both Fluent Bit and Fluentd in ECS and provide instructions for setting up custom log routing.
Firelens requires you to add a container to your task definition, though that could be provided with fluent bit configuration from the existing terraform-aws-ecs-service
module allowing us to standardise that configuration.
We may still need to provide a mechanism to forward logs collected to Elasticsearch.
AWS recommends using Fluent Bit as a log collector over Fluentd due to performance and resource considerations.
Fluent Bit is similar to Logstash in function and is capable of relaying logs to Logstash if required. See the Fluent Bit manual for details of input & output plugins.
At present we route logs through a cluster-local Logstash service that is configured to authenticate with a Elastic Cloud hosted Elasticsearch instance and forward logs there with several logstash-transit
services across the estate.
There are a few options for routing to Elasticsearch.
Directly to Elasticsearch from the Fluent Bit sidecar.
(pro) No other log forwarding infrastructure.
(con) Elasticsearch config would be required in each service.
Forwarding logs to a cluster-local collector as we do now.
(???) We can use the Fluent Bit forward
input and so replace Logstash with Fluent Bit which might be more reliable.
(con) Multiple ???-transit
services would still be required.
Provide a single collector service across the platform utilising AWS PrivateLink.
VPC Endpoint Services (AWS PrivateLink) allows you to share an NLB (Network Load Balancer) fronted service between VPCs and between AWS accounts. We have made use of AWS PrivateLink to share on-premises VPN accessible services (e.g. our Workflow service "Goobi") between accounts with success in the past.
Endpoint services would be provisioned alongside network and account configuration in the platform-infrastructure terraform stack.
(pro) One service with central configuration for Elasticsearch authentication.
(pro) We have experience with successfully using PrivateLink.
(con) Requires some log collector infrastructure and network configuration though centralised and with a single instance of the collector service.