Merikanto

一簫一劍平生意,負盡狂名十五年

AWS Well-Architected Framework

Today we will walk through the AWS Well-Architected Framework:

  • Operational Excellence (CloudFormation)
  • Security (IAM)
  • Reliability (CloudWatch)
  • Performance Efficiency (CloudWatch)
  • Cost Optimization (Cost Explorer)


Intro

Definitions

The Five Pillars of the AWS Well-Architected Framework:

  • Operational Excellence
    • Run & monitor systems
    • Continually improve supporting processes & procedures
  • Security
    • Protect information, systems, and assets via risk assessments & mitigation strategies
  • Reliability
    • Recover from infrastructure / service disruptions
    • Dynamically acquire computing resources to meet demand
    • Mitigate disruptions (misconfigs, transient network issues)
  • Performance Efficiency
    • Use computing resources efficiently to meet system requirements
    • Maintain this efficiency as demand changes
  • Cost Optimization
    • Run at lowest cost

Terms used:

  • Component: Code & configs & AWS Resources to deliver a requirement together. Decoupled from other components

  • Workload: Collection of components

  • Technology Portfolio: Collection of workloads

  • Architecture: How components work together in a workload

  • Milestone: Key changes in the architecture, as it evolves in the product lifecycle

    Product Lifecycle:

    Design -> Test -> Go Live -> In Production


When architecting workloads, we make trade-offs between pillars based on the needs (e.g. reduce cost at the expense of reliability). Security & Operation Excellence are generally not traded-off against other pillars.


On Architecture

Enterprise architecture capability:


AWS distributes capabilities to teams, rather than having a centralized team manage all.


Design Principles

1. Stop guessing capacity needs

  • Use & pay on-demand

2. Test systems at production scale

  • Create a production-scale test environment on demand (to simulate the real live environment)
  • Decommission the resource after complete testing

3. Automate to make architectural experimentation easier

  • Track changes
  • Audit impact
  • Revert to previous parameters

4. Allow for evolutionary architecture

  • Infrastructure as code
  • Constantly change architecture based on demand

5. Use data-driven architectures

  • Collect data on how architectural choices affect the workload behavior
  • Make fact-based decisions
  • Data-driven: Cloud infrastructure is code, souse that data to make informed decisions

6. Improve through game days

  • Game Days: Simulating events in production
  • E.g. load testing, test to trigger failures

Operational Excellence

Run & monitor systems

Continually improve supporting processes & procedures


Principles (5)

1. Operations as code

  • Define entire workload (application / infrastructure) as code
  • Automate operations executions by triggering them in response to events
  • Limit human error, enable consistent response

2. Annotate Documentation

  • Automate doc creation after each build

3. Make frequent, small, reversible changes

  • Rollbacks are more difficult in large changes

4. Refine operation procedures frequently

  • Set up regular game days to review & evaluate

5. Anticipate failure & Learn from them

  • Test failure scenarios
  • Share lessons across teams

Best Practices (3)

Three Best Practices:

  • Prepare
  • Operate
  • Evolve

Prepare

Prior to transition to different workload, test responses to operational events & failures. Practice responses in supported environments through failure injection & game day events.

Operations as code: Use CloudFormation to build templates for the entire infrastructure, and maintain versioning.

Visibility into workloads at all layers: log collection & monitoring.

  • Data on resource usage, application programming, API, network flow logs: Collected with CloudWatch, CloudTrail, and VPC Flow Logs.

Design workloads to understand its state:

  • Metrics
  • Logs
  • Traces

Operate

Define expected outcomes, determine how success will be measured, and identify the workload & operations metrics.

Operational health

  • Workload health
  • Health of the operations acting upon the workload (e.g. deployment & incident response)

Use established runbooks for well-understood events, and playbooks to aid in the resolution.

  • If an alert is triggered by an event, make sure there’s an associated process to be executed with a specifically identified owner

Communicate workloads’ operational status via dashboards & notifications that are tailored to target audience (e.g. customer, business, developers, operations).


Determine root cause of the unplanned events & unexpected impacts. Then use this information to mitigate future occurrence of such events.


In AWS, generate dashboard views of the metrics collected from workloads & AWS native. Gain workload insights via logging tools: X-Ray, CloudWatch, CloudTrail, VPC Flow Logs.


Routine operations & response to unplanned events should be automated, and avoid manual processes.

Align metrics to business needs, so that responses are effective at maintaining business continuity.


Understand the health of workload: Define, capture & analyze workload metrics.

Manage workload & operations events: Prepare & validate procedures for responding to events.


Evolve

Dedicate work cycles to making continuous incremental improvements.

Regularly evaluate & prioritize opportunities for improvement

  • Feature requests
  • Issue remediation
  • Compliance requirements

Include feedback loops within procedure to rapidly identify areas for improvement.


Successful evolution of operations is founded in:

  • Frequent & small improvements
  • Provide safe env & time for experiment, develop & test
  • Learn from failures

Key AWS Services

Most important: CloudFormation

Prepare

  • AWS Config & AWS Config rules: Create standards for workloads
  • Determine if environments are compliant with those standards

Operate

  • CloudWatch: Monitor operational health

Evolve

  • Amazon ElasticSearch Services (ES): Analyze log data to gain quick insights

Security

Protect information, systems, and assets via risk assessments & mitigation strategies


Principles (7)

1. Implement a strong identity foundation

  • Principle of least privilege
  • Enforce separation of duties
  • Centralize privilege management

2. Enable traceability

  • Monitor, alert & audit actions / changes to the environment in real time
  • Integrate logs & metrics

3. Apply security at all layers

  • Defense-in-depth
  • All layers
    • Edge network
    • VPC, Subnet
    • Load balancer
    • Every instance, OS, and application

4. Automate security best practices

  • Implementation of controls that are defined & managed as code, in version-controlled templates

5. Protect data in transit & at rest

  • Classify data sensitivity levels
  • Use encryption, tonkenization & access control

6. Keep people away from data

  • Reduce / eliminate the need for direct access & manual processing of data
  • Hence reduce risk of human error

7. Prepare for security events

  • Have an incident management process, and run incident response simulations
  • Use automated tools to speed up detection, investigation & recovery

Best Practices (5)

Five best practices:

  • Identity & Access Management
  • Detective Controls
  • Infrastructure Protection
  • Data Protection
  • Incident Response

Identity & Access Management

In AWS, privilege management is primarily supported by IAM. Only authorized & authenticated AWS account users / AWS Resources are able to access the resources.

  • Define principals (Use a role-based approach)

    • Users
    • Groups
    • Services
    • Roles
  • Build security policies

  • Implement strong credential


Credentials must not be shared between any user / system. Programmatic access should be performed using temporary & limited-privileged credentials, using AWS Security Token Service.


Detective Controls

Log management is important to a well-Architected design.

Use detective control to identify a potential security threat / incident.

  • Conduct an inventory of assets & their detailed attributes
  • Internal auditing: Examination of controls related to information systems

Implement detective controls

  • Process logs & events
  • Monitor logs & events for auditing, alarming & automated analysis
  • Create a threat model to defend against emerging security threats
  • Define a data-retention lifecycle (Preserved, archived, deleted)

With AWS

  • CloudTrail logs, AWS API calls & CloudWatch: Monitor metrics with alarming. Capture & analyze events from logs & metrics to gain visibility
  • AWS Config: Provides configuration history
  • GuardDuty: Threat detection
  • S3: Access service-level logs. Use S3 to log access requests

Infrastructure Protection

Encompass control methodologies, such as defense in depth.

  • In AWS, implement stateful & stateless packet inspection
  • Use VPC to create a private, secured & scalable environment to define the network topology
    • Gateways
    • Routing tables
    • Public & private subnets

Compute resources include

  • EC2 instances
  • ECS / Beanstalk / Containers
  • Lambda functions
  • Database services
  • IoT devices

Multiple layers of defense is advisable

  • Enforce boundary protection
  • Monitor points of ingress & egress
  • Comprehensive logging, monitoring & alerting

AWS provides the option to customize the configs of EC2 / ECS, and persist the config to an immutable Amazon Machine Image (AMI). When triggered by Auto Scaling / manual launch, all new instances launched with this AMI receive the customized config.


Data Protection

Practices to facilitate data protection

  • Data classification
  • Encryption & regular key rotation
  • S3: storage for exceptional resiliency
  • Encrypt data in transit & at rest
    • Server-side encryption (SSE) for S3, store data in an encrypted form
    • Use ELB to handle HTTPS encryption & decryption (SSL Termination)
  • Versioning
  • Data placed in one Region will remain in that Region, unless user move data between Regions

Incident Response

Routinely practice incident response through game days.

  • Detailed logging
  • Auto process events, and trigger auto response
  • Pre-provision tooling & clean room with CloudFormation, carry out forensics in a safe & isolate environment

Key AWS Services

Most important: IAM

Identity & Access Management

  • IAM controls access to AWS services & resources (use with MFA)
  • AWS Organizations centrally manage & enforce polices for multiple AWS accounts

Detective Controls

  • CloudTrail records AWS API calls
  • AWS Config provides a detailed inventory of AWS resources & configurations
  • GuardDuty is for threat detection, it monitors malicious / unauthorized behavior
  • CloudWatch Events can be triggered to automate security responses

Infrastructure Protection

  • VPC enables user to launch AWS resources into a virtual network
  • CloudFront is a CDN, integrated with AWS Shield for DDoS mitigation
  • AWS WAF can be deployed on CloudFront / Application Load Balancer, protect from common web exploits

Data Protection

  • In transit & at rest: ELB, EBS, S3, RDS (encryption)
  • Amazon Macie auto discovers & classifies sensitive data
  • KMS (Key Management Service) helps with create & control encryption keys

Incident Response

  • Use IAM to grant authorizations to incident response teams & response tools
  • Use CloudFormation to create a trusted environment for forensics
  • Use CloudWatch Events to create rules that trigger automated response with Lambda

Reliability

System should be designed to detect failure & automatically heal itself.

Recover from infrastructure / service disruptions

Dynamically acquire computing resources to meet demand

Mitigate disruptions (misconfigs, transient network issues)


Principles (5)

1. Test recovery procedures

  • Test how the system fails & validate recovery procedures
  • Use automation to simulate failures

2. Automatically recover from failure

  • Monitor a system for Key Performance Indicators (KPI)
  • Trigger automation when a threshold is breached
  • Anticipate & remediate failures before they occur

3. Scale horizontally to increase aggregate system availability

  • Replace single large resource with multiple small resources, reduce the impact of a single failure

4. Stop guessing capacity

  • Monitor demand & system utilization to avoid resource saturation
  • Automate addition / removal of resources

5. Manage change in automation

  • Changes should be done with automation

Best Practices (3)

Three best practices:

  • Foundations
  • Change Management
  • Failure Management

Foundations

First check foundational requirements that influence reliability, e.g. Sufficient network bandwidth. The requirements must be incorporated during initial planning. With the cloud, it is AWS’s responsibility to satisfy sufficient network bandwidth & compute capacity.


Manage service limits

  • Default service limits exist to prevent provisioning of more resources than needed.
  • AWS Direct Connect has limits on the amount of data that can be transferred

Manage network topology

  • Intra & inter system connectivity
  • Public & private IP address management
  • Name resolution

Moving from on-premise to the cloud:

First use a hybrid model, then gradually transit to the complete cloud approach.


Change Management

A scalable system provides elasticity to automatically add / remove resources.


How to monitor resources

  • Configure workload to monitor logs & metrics (automatic logging allows to audit & quickly identify actions)
  • Send notification when threshold are breached / significant events occur
  • Configure workload to self-heal automatically

Failure Management


Rather than trying to fix a failed resource, better to replace it with a new one, and analyze the failed resource out of band.


Key to managing failures

Frequent & automated testing of systems to cause failure, then observe how they recover.


Plan for Disaster Recovery (DR):

  • Back up data: Meet requirements for Mean Time To Recovery (MTTR) & Recovery Point Objectives (RPO)
  • Actively track KPIs, such as RPO & RTO (Recovery Time Objective), to assess system’s resiliency (avoid single point of failure)

Key AWS Services

Most important: CloudWatch (monitor runtime metrics)

Foundations

  • IAM enables secure access
  • VPC provide s private & isolated virtual network
  • AWS Trusted Advisor provides visibility into service limits
  • AWS Shield protect against DDoS

Change Management

  • CloudTrail records API calls & delivers log files for auditing
  • AWS Config provides detailed inventory of AWS resources & configuration
  • Auto Scaling auto manages demand for deployed workload
  • Use CloudWatch to get alerts on (custom) metrics, and aggregate logs from resources

Failure Management

  • CloudFormation provides templates for resource creation
  • S3 for keeping backups, S3 Glacier for archives
  • KMS provides reliable key management


Performance Efficiency

Use computing resources efficiently to meet system requirements

Maintain this efficiency as demand changes


Principles (5)

Democratize advanced technologies

  • Buy & consume as a service, rather than build it yourself

Go global in minutes

  • Provide lower latency

Use serverless architectures

  • Storage service as static websites (S3)
  • Event services to host the app code (Lambda)

Experiment more often

  • Use virtual & automated resources to test with different types of instances / configs

Mechanical sympathy

  • Use technology that best aligns with what you’re trying to achieve

    e.g. Consider data access patterns when selecting database / storage options


Best Practices (4)

Four best practices:

  • Selection
  • Review
  • Monitoring
  • Tradeoffs (e.g. Compression / caching, relax consistency requirements)

Take a data-driven approach to select a high-performance architecture.


Selection

Data-driven approach is the most optimal solution, and data obtained through benchmarking / load testing will be required to optimize the architecture.

Different architectural approaches:

  • Event-driven
  • ETL
  • Pipeline

Four main resource types to consider:

  • Compute
  • Storage
  • Database
  • Network

I. Compute

The optimal compute solution is based on application design, usage patterns, and configuration settings. In AWS, compute is available in 3 forms: instances (EC2), containers (ECS), functions (Lambda).

  • Instances are virtual servers, offer HDDs, SSDs & GPUs
  • Containers are OS virtualization, allow user to run apps & dependencies in resource-isolated processes
  • Functions abstract the execution env from the code. Lambda executes code without running an instance

II. Storage

Select optimal storage solution based on:

  • Access method (block / file / object)
  • Access pattern (random / sequential)
  • Access frequency (online / offline / archival)
  • Update frequency (WORM / dynamic)
  • Required throughput
  • Availability & durability constraints

III. Database

Optimal database solution is based on:

  • Availability
  • Consistency
  • Partition tolerance
  • Latency
  • Durability
  • Scalability
  • Query capability

Sometimes, non-database solutions solve the problem more efficiently (e.g. search engine / data warehouse)


AWS Services

  • RDS
  • DynamoDB
  • Redshift (data warehouse)

IV. Network

Optimal network solution is based on:

  • Latency (Need to consider location when selecting network solutions)
  • Throughput requirements

Physical constraints (user / on-premise resources) can be offset using edge techniques / resource placement.


AWS Services

Product Features:

  • Enhanced networking
  • EBS-optimized instances
  • S3 Transfer Acceleration
  • Dynamic Amazon CloudFront

Network Features (reduce network distance / latency):

  • Route 53 latency routing
  • VPC endpoints
  • AWS Direct Connect

Review

Understand where the architecture is performance-constrained.


Monitoring

Monitoring metrics should be used to raise alarms, when threshold breached.

  • Degradation of system performance over time
  • Remediate OS / application load

Make sure there aren’t too many false positives, or you’re overwhelmed with data.


AWS Services:

  • CloudWatch: monitor & send notification alarms
  • Use automation by triggering actions via Kinesis / SQS / Lambda

Tradeoffs

Trade consistency, durability & space for time / latency.

AWS Services:

  • Caching solution: ElastiCache (Redis / Memcached, in-memory data store)
  • Cache content closer to end users: CloudFront
  • Distributed caching tier: DAX (DynamoDB Accelerator, read-through / write-through)

Key AWS Services

Most important: CloudWatch

  • Monitor resources & systems
  • Provide visibility into overall performance & operational health

Selection

  • Compute: Auto Scaling
  • Storage:
    • EBS (SSD, PIOPS (provisioned input/output operations per second) )
    • S3 (serverless content delivery, S3 transfer acceleration)
  • Database

Review

Monitoring

  • CloudWatch provides metrics, alarms & notifications

Tradeoffs

  • Improve performance: ElastiCache, CloudFront, Snowball
  • Scale read-heavy workloads: Use read replicas in RDS

Cost Optimization

Run at lowest cost


Principles (5)

  • Adopt a consumption model
  • Measure overall efficiency
  • Stop spending money on data center operations
  • Analyze & attribute expenditure
  • Use managed & application-level services to reduce cost of ownership

Best Practices (4)

Spend time benchmarking for the most cost-optimal workload over time.

Four best practices:

  • Expenditure awareness
  • Cost-effective resources
  • Matching supply & demand
  • Optimizing over time

Key AWS Services

Most Important: Cost Explorer

Expenditure awareness

  • Cost Explorer

Cost-effective resources

  • CloudWatch & Trusted Advisor for the right size of the resources

  • AWS Direct Connect & CloudFront for optimizing data transfer

Matching supply & demand

  • Auto Scaling

Optimizing over time

  • Trusted Advisor inspects AWS environment, finds & eliminates idle / unused resources, or commits to Reserved Instance capacity
  • Read AWS Blog

Review

Use AWS Well-Architected Framework to continually review the architecture, rather than holding formal review meetings. Reviews should be applied at:

  • Early on in the design phase
  • Key milestones in the product lifecycle
  • Before the go live date

Q & A

Operational Excellence

Automation

Fully automate integration & deployment


How do you know you’re ready to support a workload

  • Use runbooks to perform procedures
  • Use playbooks to identify issues
  • Ensure consistent review of operational readiness

Understand the health of workload / operations

  • Identify KPI (key performance indicators)
  • Define & collect workload metrics
  • Establish workload metrics baseline, and learn expected patterns of activities (for benchmarking)
  • Alert when workload at risk & anomalies detected

Security

Manage credentials & authentication

  • Define IAM requirements
  • Secure AWS root user, create IAM users for access
  • Automate enforcement of access controls
  • Rotate credentials regularly

Control human access

  • Grant least privileges
  • Unique credentials for each individual (segregation & traceability)

Detect & investigate security events

  • Collect metrics & define baselines
  • All logs should be collected centrally & automatically

Reliability

Manage network topology

  • Use Highly Available connectivity between private addresses in public clouds & on-premise environment
  • Enforce non-overlapping private IP address range in multiple private address spaces where they’re connected

Data backup

  • Perform data backup automatically
  • Perform periodic recovery of data, to verify backup integrity & processes

Dealing with failures

  • Monitor all layers of the workload, send notifications upon failure detection
  • Implement loosely-coupled dependencies
  • Deploy workload to multiple locations
  • Automate self-healing on all layers

Test resilience

  • Use playbooks for unanticipated failures
  • Conduct RCA (Root Cause Analysis)
  • Inject failures to test resiliency
  • Game days

Plan for DR (Disaster Recovery)

  • Define recovery objectives for downtime & data loss (RTO, RPO)
  • Use defined recovery strategies to meet recovery objectives
  • Manage configuration drift on all changes
  • Automate recovery

Performance Efficiency

Select best-performing architecture

  • Use reference architectures / policies
  • Load test the workload

Select compute solution

  • Collect compute-related metrics
  • Re-evaluate compute needs based on metrics

Select storage solution

  • Know storage characteristics & requirements (S3, EBS, EFS, EC2 instance store)
  • Decide based on access patterns & metrics

Select networking solution

  • Use minimal network ACLs
  • Leverage encryption offloading & load-balancing
  • Optimize network configs based on metrics

Cost Optimization

Govern usage

  • Implement account structure
  • Implement groups & roles
  • Track project / product lifecycle
  • Analyze all components of a chosen workload


``