Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
  • RFC 075: Using Apache Iceberg tables in Catalogue Pipeline adapters
Powered by GitBook
On this page
  • What we want
  • What we've done so far
  • CloudWatch
  • Logback
  • Proposed solution
  • Firelens
  • Fluent Bit
  • Routing logs to Elasticsearch

RFC 022: Logging

PreviousRFC 021: Data science in the pipelineNextLogging example

Last updated 10 days ago

This RFC describes a proposal for how we log from our services, and how we collect and search those logs.

Last modified: 2020-06-29T18:14:44+01:00

2020-03-13 17:55:23.372 [apm-remote-config-poller] ERROR co.elastic.apm.agent.report.HttpUtils - Exception when closing input stream of HttpURLConnection.
2020-03-13 17:50:18.315 [apm-remote-config-poller] ERROR co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Read timed out

We use logs to monitor running services following the .

Logs are human readable semi-structured text output. Having visibility of what our services are up lets us more easily make decisions about how to react to their behaviour.

What we want

A logging service should be easy to query, reliable, low cost and simple to configure for clients.

In order to see what's going on easily across the estate, logs from all applications should be searchable in one place.

We should annotate logs with:

  • The originating product, service and environment e.g. [catalogue-pipeline, id_minter, stage]

  • A release reference e.g. the container image tag items_api:31982ae519bc6a5856cf8c078573a320aaca69bf

What we've done so far

CloudWatch

Previously we have used the in ECS which pushes logs into .

Pros

We don't need to worry about running any of our own log collection and searching infrastructure.

Cons

We have found CloudWatch Logs to be expensive for us when publishing millions of log lines. We haven't found an easy way to query logs across services and over periods of time longer than a few days.

Logback

The logstash-transit service provides log collection for a particular cluster and handles Elastic Cloud authentication and holds that configuration in one place for that cluster.

Pros

Elasticsearch and Kibana provide an effective mechanism for aggregating and searching Logs. We have found that querying logs in this way is much faster and more effective than using CloudWatch.

Cons

We are managing a significant portion of our own logging infrastructure, using language specific libraries and running multiple log collectors.

Each application must contain the Logback library and be configured to send logs to the local logstash-transit service. We don't have an effective mechanism for standardising the configuration of both applications and logstash services. As a consequence of inconsistent configuration, services logs are not reliably available across the platform.

There is also proliferation of logstash-transit services in the platform adding cost and complexity.

Proposed solution

If we can take advantage of the container level logging drivers available in ECS we can return to a position where our applications have no configuration for transporting logs elsewhere. This would eliminate some of the current configuration and infrastructure complexity.

Firelens

We may still need to provide a mechanism to forward logs collected to Elasticsearch.

Fluent Bit

Routing logs to Elasticsearch

At present we route logs through a cluster-local Logstash service that is configured to authenticate with a Elastic Cloud hosted Elasticsearch instance and forward logs there with several logstash-transit services across the estate.

There are a few options for routing to Elasticsearch.

  • Directly to Elasticsearch from the Fluent Bit sidecar.

    • (pro) No other log forwarding infrastructure.

    • (con) Elasticsearch config would be required in each service.

  • Forwarding logs to a cluster-local collector as we do now.

    • (con) Multiple ???-transit services would still be required.

    • (pro) One service with central configuration for Elasticsearch authentication.

    • (pro) We have experience with successfully using PrivateLink.

    • (con) Requires some log collector infrastructure and network configuration though centralised and with a single instance of the collector service.

This mechanism allows the application to care very little about how logging is done. Configuration can be made .

In order to reduce costs and allow us to use a familiar and well suited log querying solution, we have moved to using an .

Each application incorporates the and must carry the configuration about how to log to Logstash. Each cluster has a logstash-transit service to allow logs to be relayed to Elasticsearch running in Elastic Cloud.

At time of writing the logging drivers available for the containers that make up the vast majority of our services .

awsfirelens is a solution for custom log routing that makes use of a or container as a to your application container.

AWS provides pre-built images for running both Fluent Bit and Fluentd in ECS and provide .

Firelens requires you to add a container to your task definition, though that could be provided with fluent bit configuration from the existing module allowing us to standardise that configuration.

AWS recommends using Fluent Bit as a log collector over Fluentd .

Fluent Bit is similar to Logstash in function and is capable of if required. See the manual for details of input & output plugins.

(???) We can use the Fluent Bit input and so replace Logstash with Fluent Bit which might be more reliable.

Provide a single collector service across the platform utilising .

VPC Endpoint Services (AWS PrivateLink) allows you to share an fronted service between VPCs and between AWS accounts. We have made use of AWS PrivateLink to share on-premises VPN accessible services (e.g. our Workflow service "Goobi") between accounts with success in the past.

Endpoint services would be provisioned alongside network and account configuration in the terraform stack.

Log aggregation pattern
awslogs Log Driver
CloudWatch Logs
part of a service infrastructure template
ElasticSearch
"ELK" (Elasticsearch, Logstash, Kibana) stack
Logback library
AWS Fargate
are limited to awslogs, splunk, and awsfirelens
Fluent Bit
Fluentd
sidecar
instructions for setting up custom log routing
terraform-aws-ecs-service
due to performance and resource considerations
relaying logs to Logstash
Fluent Bit
forward
AWS PrivateLink
NLB (Network Load Balancer)
platform-infrastructure