Merikanto

一簫一劍平生意,負盡狂名十五年

AWS - 08 Monitoring Services

In the eighth post of the AWS series, we’re going to talk about three monitoring services today:

  • CloudWatch (Metrics & Logs)
  • X-Ray (Traces)
  • CloudTrail (Audit)


General

Observability:

  • CloudWatch (Metrics & Logs)
  • X-Ray (Traces)

CloudTrail only tracks client API calls. X-Ray traces within the AWS service.


Since the errors are being received intermittently, it’s better to collect and aggregate the results at regular intervals and then send the data to CloudWatch.


X-Ray:

  • Trace and analyze user requests as they travel through GW APIs to the underlying services
  • API Gateway supports AWS X-Ray tracing for all API Gateway endpoint types (regional, edge-optimized, and private)
  • X-Ray gives you an end-to-end view of an entire request, so you can analyze latencies in your APIs and their backend services

You can use an X-Ray service map to view the latency of an entire request, and latency of the downstream services integrated with X-Ray. And you can configure sampling rules to tell X-Ray which requests to record, at what sampling rates, according to criteria that you specify.


区别 :

  • CloudTrail is primarily used for API logging of all of your AWS resources

  • CloudWatch is a monitoring and management service. It does not have the capability to trace and analyze user requests as they travel through APIs

  • VPC flow logs enable you to capture information about the IP traffic going to and from network interfaces in your entire VPC

Although it can capture some details about the incoming user requests, it is still better to use AWS X-Ray as it is a better way to debug and analyze your microservices applications with request tracing, so you can find the root cause of your issues and performance.



CloudWatch (Metrics & Logs)

In essence, CW is a metric repository


  • Monitoring tool for your AWS resources and applications
  • CW metrics are not shared across regions
  • Display metrics & create alarms that watch the metrics and send notifications or automatically make changes to the resources you are monitoring, when a threshold is breached

Concepts

  • Namespaces: Container for CW metrics
  • Metrics: ordered time-series data
    • Cannot be deleted, but auto expire after 15 months
    • Each metric data point is marked with a timestamp
    • CW Detailed monitoring: publish your own application metrics
    • EC2 metrics: CW does not collect memory utils and disk space usage metrics automatically. Need to install CloudWatch Agent in your instances first to retrieve these metrics
  • Dimension: Name-value pair that uniquely identifies a metric
  • Statistics: metric data aggregation

CW Events

  • Deliver near real-time stream of system events that describe changes in AWS resources
  • Events: change in the AWS environment
  • Targets: process events
  • Rules: Matches incoming events & route them to targets for processing

CW Logs

  • Monitor logs from EC2 instances in real-time
  • Monitor CT logged events
  • By default, logs are kept indefinitely and never expire
  • CW Log Insights: interactively search and analyze your log data in CloudWatch Logs using queries

CW Agent

  • Collect more logs and system-level metrics from EC2 instances and your on-premises servers
  • Needs to be installed first

Security

  • IAM users / roles
  • Dashboard permissions, IAM identity-based policies, service-linked roles


X-Ray (Performance Monitoring)

  • X-Ray analyzes and debugs apps, such as those built using a microservices architecture. With X-Ray, you can identify performance bottlenecks, edge case errors, and other hard to detect issues
  • X-Ray daemon buffers segments in a queue, and uploads them to X-Ray in batches
    • Listens for UDP traffic (port 2000)
    • Gathers raw segment data
    • Relays to X-Ray API

X-Ray SDK does not send data directly to X-Ray!

  • To avoid calling the service every time your application serves a request, the SDK sends the trace data to a daemon, which collects segments for multiple requests and uploads them in batches.

  • To properly instrument your application hosted in an EC2 instance, you have to install the X-Ray daemon by using a user data script. This will install and run the daemon automatically when you launch the instance.

To use the daemon on Amazon EC2, create a new instance profile role or add the managed policy to an existing one. This will grant the daemon permission to upload trace data to X-Ray.


Amazon Inspector: Automated security assessment service that helps improve application security and compliance deployed on AWS


Concepts

  • Segment: Provides the name of the compute resources running your application logic, details about the request sent by your application, and details about the work done

  • X-Ray uses the data that your application sends to generate a service graph (JSON document). Each AWS resource that sends data to X-Ray appears as a service in the graph

  • Trace collects all the segments generated by a single request

    The request is typically an HTTP GET or POST request that travels through a load balancer, hits your application code, and generates downstream calls to other AWS services or external web APIs

  • Use filter expression for advanced tracing

  • Groups are a collection of traces that are defined by a filter expression (identified by name or ARN)

  • 🧡 Annotations are simple key-value pairs that are indexed for use with filter expressions. Use annotations to record data that you want to bundle traces by groups

    • A segment can contain multiple annotations
    • System-defined annotations include data added to the segment by AWS services, whereas user-defined annotations are metadata added to a segment by a developer

Features

  • X-Ray can be used with Lambda, EC2, ECS, Beanstalk (integrate X-Ray SDK in the application, and install X-Ray Agent)
  • Provide end-to-end, cross-service, application-centric view of requests flowing through your application, by aggregating the data gathered from individual services of the application into a single unit called trace
  • X-Ray SDK captures metadata for requests made to RDS & DynamoDB, and SQS & SNS
  • Set trace sampling rate: X-Ray continually traces requests made to the application, and stores a sampling of the requests for analysis
  • X-Ray creates a map of services used by your application with trace data

流程

  • X-Ray receives data from services as segments
  • X-Ray then groups segments that have a common request into traces
  • X-Ray processes the traces to generate a service graph that provides a visual representation of your application.

Types of X-Ray integration

  • Active instrumentation: Samples and instruments incoming requests
  • Passive instrumentation: Instrument requests that have been sampled by another service
  • Request tracing: Adds a tracing header to all incoming requests and propagates it downstream
  • Tooling: Runs the X-Ray daemon to receive segments from the X-Ray SDK

X-Ray integration with AWS services

  • Lambda

    • Active and passive instrumentation of incoming requests on all runtimes
    • Lambda adds two nodes to your service map, one for the AWS Lambda service, and one for the function
  • API Gateway

    • Active and passive instrumentation.
    • GW uses sampling rules to determine which requests to record, and adds a node for the gateway stage to your service map
  • ELB

    • Request tracing on ALBs
    • ALB adds the trace ID to the request header before sending it to a target group
  • Beanstalk

    • Tooling
  • EC2

    • Use a user data script to install the X-Ray daemon


CloudTrail (Log Management)

CT: logs (CT triggers CW logs)

CW: metrics


View events in Event History (actions taken by user / role / services)

CT Trails:

  • One region
  • all regions (default)
  • Organization trail

  • By default, CloudTrail event log files are encrypted using S3 server-side encryption. You can also encrypt log files with KMS
  • Use SNS for log delivery & validation
  • CT publish logs every 5 min

Events

  • Management events

    • Logged (default)
    • Insight, control plane operations
  • Data events

    • Not logged (default)
    • Data plane ops
    • High-volume activities

Monitoring

  • CW Logs to monitor log data

    CT does not capture error logs in EC2 instance; Need CW logs for this.

  • CT events that are sent to CW Logs can trigger alarms according to the metric filters you define

  • CT log file integrity validation: Determine whether a log file was modified / deleted after CT delivers it