AWS Well-Architected Framework

Today we will walk through the AWS Well-Architected Framework:

Operational Excellence (CloudFormation)
Security (IAM)
Reliability (CloudWatch)
Performance Efficiency (CloudWatch)
Cost Optimization (Cost Explorer)

Intro

Definitions

The Five Pillars of the AWS Well-Architected Framework:

Operational Excellence
- Run & monitor systems
- Continually improve supporting processes & procedures
Security
- Protect information, systems, and assets via risk assessments & mitigation strategies
Reliability
- Recover from infrastructure / service disruptions
- Dynamically acquire computing resources to meet demand
- Mitigate disruptions (misconfigs, transient network issues)
Performance Efficiency
- Use computing resources efficiently to meet system requirements
- Maintain this efficiency as demand changes
Cost Optimization
- Run at lowest cost

Terms used:

Component: Code & configs & AWS Resources to deliver a requirement together. Decoupled from other components
Workload: Collection of components
Technology Portfolio: Collection of workloads
Architecture: How components work together in a workload
Milestone: Key changes in the architecture, as it evolves in the product lifecycle

Product Lifecycle:

Design -> Test -> Go Live -> In Production

When architecting workloads, we make trade-offs between pillars based on the needs (e.g. reduce cost at the expense of reliability). Security & Operation Excellence are generally not traded-off against other pillars.

On Architecture

Enterprise architecture capability:

AWS distributes capabilities to teams, rather than having a centralized team manage all.

Design Principles

1. Stop guessing capacity needs

Use & pay on-demand

2. Test systems at production scale

Create a production-scale test environment on demand (to simulate the real live environment)
Decommission the resource after complete testing

3. Automate to make architectural experimentation easier

Track changes
Audit impact
Revert to previous parameters

4. Allow for evolutionary architecture

Infrastructure as code
Constantly change architecture based on demand

5. Use data-driven architectures

Collect data on how architectural choices affect the workload behavior
Make fact-based decisions
Data-driven: Cloud infrastructure is code, souse that data to make informed decisions

6. Improve through game days

Game Days: Simulating events in production
E.g. load testing, test to trigger failures

Operational Excellence

Run & monitor systems

Continually improve supporting processes & procedures

Principles (5)

1. Operations as code

Define entire workload (application / infrastructure) as code
Automate operations executions by triggering them in response to events
Limit human error, enable consistent response

2. Annotate Documentation

Automate doc creation after each build

3. Make frequent, small, reversible changes

Rollbacks are more difficult in large changes

4. Refine operation procedures frequently

Set up regular game days to review & evaluate

5. Anticipate failure & Learn from them

Test failure scenarios
Share lessons across teams

Best Practices (3)

Three Best Practices:

Prepare
Operate
Evolve

Prepare

Prior to transition to different workload, test responses to operational events & failures. Practice responses in supported environments through failure injection & game day events.

Operations as code: Use CloudFormation to build templates for the entire infrastructure, and maintain versioning.

Visibility into workloads at all layers: log collection & monitoring.

Data on resource usage, application programming, API, network flow logs: Collected with CloudWatch, CloudTrail, and VPC Flow Logs.

Design workloads to understand its state:

Metrics
Logs
Traces

Operate

Define expected outcomes, determine how success will be measured, and identify the workload & operations metrics.

Operational health

Workload health
Health of the operations acting upon the workload (e.g. deployment & incident response)

Use established runbooks for well-understood events, and playbooks to aid in the resolution.

If an alert is triggered by an event, make sure there’s an associated process to be executed with a specifically identified owner

Communicate workloads’ operational status via dashboards & notifications that are tailored to target audience (e.g. customer, business, developers, operations).

Determine root cause of the unplanned events & unexpected impacts. Then use this information to mitigate future occurrence of such events.

In AWS, generate dashboard views of the metrics collected from workloads & AWS native. Gain workload insights via logging tools: X-Ray, CloudWatch, CloudTrail, VPC Flow Logs.

Routine operations & response to unplanned events should be automated, and avoid manual processes.

Align metrics to business needs, so that responses are effective at maintaining business continuity.

Understand the health of workload: Define, capture & analyze workload metrics.

Manage workload & operations events: Prepare & validate procedures for responding to events.

Evolve

Dedicate work cycles to making continuous incremental improvements.

Regularly evaluate & prioritize opportunities for improvement

Feature requests
Issue remediation
Compliance requirements

Include feedback loops within procedure to rapidly identify areas for improvement.

Successful evolution of operations is founded in:

Frequent & small improvements
Provide safe env & time for experiment, develop & test
Learn from failures

Key AWS Services

Most important: CloudFormation

Prepare

AWS Config & AWS Config rules: Create standards for workloads
Determine if environments are compliant with those standards

Operate

CloudWatch: Monitor operational health

Evolve

Amazon ElasticSearch Services (ES): Analyze log data to gain quick insights

Security

Protect information, systems, and assets via risk assessments & mitigation strategies

Principles (7)

1. Implement a strong identity foundation

Principle of least privilege
Enforce separation of duties
Centralize privilege management

2. Enable traceability

Monitor, alert & audit actions / changes to the environment in real time
Integrate logs & metrics

3. Apply security at all layers

Defense-in-depth
All layers
- Edge network
- VPC, Subnet
- Load balancer
- Every instance, OS, and application

4. Automate security best practices

Implementation of controls that are defined & managed as code, in version-controlled templates

5. Protect data in transit & at rest

Classify data sensitivity levels
Use encryption, tonkenization & access control

6. Keep people away from data

Reduce / eliminate the need for direct access & manual processing of data
Hence reduce risk of human error

7. Prepare for security events

Have an incident management process, and run incident response simulations
Use automated tools to speed up detection, investigation & recovery

Best Practices (5)

Five best practices:

Identity & Access Management
Detective Controls
Infrastructure Protection
Data Protection
Incident Response

Identity & Access Management

In AWS, privilege management is primarily supported by IAM. Only authorized & authenticated AWS account users / AWS Resources are able to access the resources.

Define principals (Use a role-based approach)
- Users
- Groups
- Services
- Roles
Build security policies
Implement strong credential

Credentials must not be shared between any user / system. Programmatic access should be performed using temporary & limited-privileged credentials, using AWS Security Token Service.

Detective Controls

Log management is important to a well-Architected design.

Use detective control to identify a potential security threat / incident.

Conduct an inventory of assets & their detailed attributes
Internal auditing: Examination of controls related to information systems

Implement detective controls

Process logs & events
Monitor logs & events for auditing, alarming & automated analysis
Create a threat model to defend against emerging security threats
Define a data-retention lifecycle (Preserved, archived, deleted)

With AWS

CloudTrail logs, AWS API calls & CloudWatch: Monitor metrics with alarming. Capture & analyze events from logs & metrics to gain visibility
AWS Config: Provides configuration history
GuardDuty: Threat detection
S3: Access service-level logs. Use S3 to log access requests

Infrastructure Protection

Encompass control methodologies, such as defense in depth.

In AWS, implement stateful & stateless packet inspection
Use VPC to create a private, secured & scalable environment to define the network topology
- Gateways
- Routing tables
- Public & private subnets

Compute resources include

EC2 instances
ECS / Beanstalk / Containers
Lambda functions
Database services
IoT devices

Multiple layers of defense is advisable

Enforce boundary protection
Monitor points of ingress & egress
Comprehensive logging, monitoring & alerting

AWS provides the option to customize the configs of EC2 / ECS, and persist the config to an immutable Amazon Machine Image (AMI). When triggered by Auto Scaling / manual launch, all new instances launched with this AMI receive the customized config.

Data Protection

Practices to facilitate data protection

Data classification
Encryption & regular key rotation
S3: storage for exceptional resiliency
Encrypt data in transit & at rest
- Server-side encryption (SSE) for S3, store data in an encrypted form
- Use ELB to handle HTTPS encryption & decryption (SSL Termination)
Versioning
Data placed in one Region will remain in that Region, unless user move data between Regions

Incident Response

Routinely practice incident response through game days.

Detailed logging
Auto process events, and trigger auto response
Pre-provision tooling & clean room with CloudFormation, carry out forensics in a safe & isolate environment

Key AWS Services

Most important: IAM

Identity & Access Management

IAM controls access to AWS services & resources (use with MFA)
AWS Organizations centrally manage & enforce polices for multiple AWS accounts

Detective Controls

CloudTrail records AWS API calls
AWS Config provides a detailed inventory of AWS resources & configurations
GuardDuty is for threat detection, it monitors malicious / unauthorized behavior
CloudWatch Events can be triggered to automate security responses

Infrastructure Protection

VPC enables user to launch AWS resources into a virtual network
CloudFront is a CDN, integrated with AWS Shield for DDoS mitigation
AWS WAF can be deployed on CloudFront / Application Load Balancer, protect from common web exploits

Data Protection

In transit & at rest: ELB, EBS, S3, RDS (encryption)
Amazon Macie auto discovers & classifies sensitive data
KMS (Key Management Service) helps with create & control encryption keys

Incident Response

Use IAM to grant authorizations to incident response teams & response tools
Use CloudFormation to create a trusted environment for forensics
Use CloudWatch Events to create rules that trigger automated response with Lambda

Reliability

System should be designed to detect failure & automatically heal itself.

Recover from infrastructure / service disruptions

Dynamically acquire computing resources to meet demand

Mitigate disruptions (misconfigs, transient network issues)

Principles (5)

1. Test recovery procedures

Test how the system fails & validate recovery procedures
Use automation to simulate failures

2. Automatically recover from failure

Monitor a system for Key Performance Indicators (KPI)
Trigger automation when a threshold is breached
Anticipate & remediate failures before they occur

3. Scale horizontally to increase aggregate system availability

Replace single large resource with multiple small resources, reduce the impact of a single failure

4. Stop guessing capacity

Monitor demand & system utilization to avoid resource saturation
Automate addition / removal of resources

5. Manage change in automation

Changes should be done with automation

Best Practices (3)

Three best practices:

Foundations
Change Management
Failure Management

Foundations

First check foundational requirements that influence reliability, e.g. Sufficient network bandwidth. The requirements must be incorporated during initial planning. With the cloud, it is AWS’s responsibility to satisfy sufficient network bandwidth & compute capacity.

Manage service limits

Default service limits exist to prevent provisioning of more resources than needed.
AWS Direct Connect has limits on the amount of data that can be transferred

Manage network topology

Intra & inter system connectivity
Public & private IP address management
Name resolution

Moving from on-premise to the cloud:

First use a hybrid model, then gradually transit to the complete cloud approach.

Change Management

A scalable system provides elasticity to automatically add / remove resources.

How to monitor resources

Configure workload to monitor logs & metrics (automatic logging allows to audit & quickly identify actions)
Send notification when threshold are breached / significant events occur
Configure workload to self-heal automatically

Failure Management

Rather than trying to fix a failed resource, better to replace it with a new one, and analyze the failed resource out of band.

Key to managing failures

Frequent & automated testing of systems to cause failure, then observe how they recover.

Plan for Disaster Recovery (DR):

Back up data: Meet requirements for Mean Time To Recovery (MTTR) & Recovery Point Objectives (RPO)
Actively track KPIs, such as RPO & RTO (Recovery Time Objective), to assess system’s resiliency (avoid single point of failure)

Key AWS Services

Most important: CloudWatch (monitor runtime metrics)

Foundations

IAM enables secure access
VPC provide s private & isolated virtual network
AWS Trusted Advisor provides visibility into service limits
AWS Shield protect against DDoS

Change Management

CloudTrail records API calls & delivers log files for auditing
AWS Config provides detailed inventory of AWS resources & configuration
Auto Scaling auto manages demand for deployed workload
Use CloudWatch to get alerts on (custom) metrics, and aggregate logs from resources

Failure Management

CloudFormation provides templates for resource creation
S3 for keeping backups, S3 Glacier for archives
KMS provides reliable key management

Performance Efficiency

Use computing resources efficiently to meet system requirements

Maintain this efficiency as demand changes

Principles (5)

Democratize advanced technologies

Buy & consume as a service, rather than build it yourself

Go global in minutes

Provide lower latency

Use serverless architectures

Storage service as static websites (S3)
Event services to host the app code (Lambda)

Experiment more often

Use virtual & automated resources to test with different types of instances / configs

Mechanical sympathy

Use technology that best aligns with what you’re trying to achieve

e.g. Consider data access patterns when selecting database / storage options

Best Practices (4)

Four best practices:

Selection
Review
Monitoring
Tradeoffs (e.g. Compression / caching, relax consistency requirements)

Take a data-driven approach to select a high-performance architecture.

Selection

Data-driven approach is the most optimal solution, and data obtained through benchmarking / load testing will be required to optimize the architecture.

Different architectural approaches:

Event-driven
ETL
Pipeline

Four main resource types to consider:

Compute
Storage
Database
Network

I. Compute

The optimal compute solution is based on application design, usage patterns, and configuration settings. In AWS, compute is available in 3 forms: instances (EC2), containers (ECS), functions (Lambda).

Instances are virtual servers, offer HDDs, SSDs & GPUs
Containers are OS virtualization, allow user to run apps & dependencies in resource-isolated processes
Functions abstract the execution env from the code. Lambda executes code without running an instance

II. Storage

Select optimal storage solution based on:

Access method (block / file / object)
Access pattern (random / sequential)
Access frequency (online / offline / archival)
Update frequency (WORM / dynamic)
Required throughput
Availability & durability constraints

III. Database

Optimal database solution is based on:

Availability
Consistency
Partition tolerance
Latency
Durability
Scalability
Query capability

Sometimes, non-database solutions solve the problem more efficiently (e.g. search engine / data warehouse)

AWS Services

RDS
DynamoDB
Redshift (data warehouse)

IV. Network

Optimal network solution is based on:

Latency (Need to consider location when selecting network solutions)
Throughput requirements

Physical constraints (user / on-premise resources) can be offset using edge techniques / resource placement.

AWS Services

Product Features:

Enhanced networking
EBS-optimized instances
S3 Transfer Acceleration
Dynamic Amazon CloudFront

Network Features (reduce network distance / latency):

Route 53 latency routing
VPC endpoints
AWS Direct Connect

Review

Understand where the architecture is performance-constrained.

Monitoring

Monitoring metrics should be used to raise alarms, when threshold breached.

Degradation of system performance over time
Remediate OS / application load

Make sure there aren’t too many false positives, or you’re overwhelmed with data.

AWS Services:

CloudWatch: monitor & send notification alarms
Use automation by triggering actions via Kinesis / SQS / Lambda

Tradeoffs

Trade consistency, durability & space for time / latency.

AWS Services:

Caching solution: ElastiCache (Redis / Memcached, in-memory data store)
Cache content closer to end users: CloudFront
Distributed caching tier: DAX (DynamoDB Accelerator, read-through / write-through)

Key AWS Services

Most important: CloudWatch

Monitor resources & systems
Provide visibility into overall performance & operational health

Selection

Compute: Auto Scaling
Storage:
- EBS (SSD, PIOPS (provisioned input/output operations per second) )
- S3 (serverless content delivery, S3 transfer acceleration)
Database

Review

AWS Blog

Monitoring

CloudWatch provides metrics, alarms & notifications

Tradeoffs

Improve performance: ElastiCache, CloudFront, Snowball
Scale read-heavy workloads: Use read replicas in RDS

Cost Optimization

Run at lowest cost

Principles (5)

Adopt a consumption model
Measure overall efficiency
Stop spending money on data center operations
Analyze & attribute expenditure
Use managed & application-level services to reduce cost of ownership

Best Practices (4)

Spend time benchmarking for the most cost-optimal workload over time.

Four best practices:

Expenditure awareness
Cost-effective resources
Matching supply & demand
Optimizing over time

Key AWS Services

Most Important: Cost Explorer

Expenditure awareness

Cost Explorer

Cost-effective resources

CloudWatch & Trusted Advisor for the right size of the resources
AWS Direct Connect & CloudFront for optimizing data transfer

Matching supply & demand

Auto Scaling

Optimizing over time

Trusted Advisor inspects AWS environment, finds & eliminates idle / unused resources, or commits to Reserved Instance capacity
Read AWS Blog

Review

Use AWS Well-Architected Framework to continually review the architecture, rather than holding formal review meetings. Reviews should be applied at:

Early on in the design phase
Key milestones in the product lifecycle
Before the go live date

Q & A

Operational Excellence

Automation

Fully automate integration & deployment

How do you know you’re ready to support a workload

Use runbooks to perform procedures
Use playbooks to identify issues
Ensure consistent review of operational readiness

Understand the health of workload / operations

Identify KPI (key performance indicators)
Define & collect workload metrics
Establish workload metrics baseline, and learn expected patterns of activities (for benchmarking)
Alert when workload at risk & anomalies detected

Security

Manage credentials & authentication

Define IAM requirements
Secure AWS root user, create IAM users for access
Automate enforcement of access controls
Rotate credentials regularly

Control human access

Grant least privileges
Unique credentials for each individual (segregation & traceability)

Detect & investigate security events

Collect metrics & define baselines
All logs should be collected centrally & automatically

Reliability

Manage network topology

Use Highly Available connectivity between private addresses in public clouds & on-premise environment
Enforce non-overlapping private IP address range in multiple private address spaces where they’re connected

Data backup

Perform data backup automatically
Perform periodic recovery of data, to verify backup integrity & processes

Dealing with failures

Monitor all layers of the workload, send notifications upon failure detection
Implement loosely-coupled dependencies
Deploy workload to multiple locations
Automate self-healing on all layers

Test resilience

Use playbooks for unanticipated failures
Conduct RCA (Root Cause Analysis)
Inject failures to test resiliency
Game days

Plan for DR (Disaster Recovery)

Define recovery objectives for downtime & data loss (RTO, RPO)
Use defined recovery strategies to meet recovery objectives
Manage configuration drift on all changes
Automate recovery

Performance Efficiency

Select best-performing architecture

Use reference architectures / policies
Load test the workload

Select compute solution

Collect compute-related metrics
Re-evaluate compute needs based on metrics

Select storage solution

Know storage characteristics & requirements (S3, EBS, EFS, EC2 instance store)
Decide based on access patterns & metrics

Select networking solution

Use minimal network ACLs
Leverage encryption offloading & load-balancing
Optimize network configs based on metrics

Cost Optimization

Govern usage

Implement account structure
Implement groups & roles
Track project / product lifecycle
Analyze all components of a chosen workload