Ace the AWS Well Architected Framework_ Learn, Measure
Ace the AWS Well Architected Framework_ Learn, Measure
Introduction
The well-architected framework is a framework to measure applications
running in the cloud against a set of strategies and best practises. The
framework has been compiled by AWS after working closely with 1000s
of customers. The purpose of this document is to empower people to make
informed decisions about their cloud architectures and help them
understand the impact of their decisions. The program is a starting point in
the architectural process that should be used as a catalyst in further
thinking and conversations.
General Design Principals
There are 3 main components of the well-architected framework:
• General
Design
Principals
• The Five
Pillars
• Review
(or
Questions)
Let’s go through these components one by one.
Cloud computing has opened up the technology space to a whole new lot
of thinking where constraints that we used to have in the traditional
environment no longer exist. When thinking about design principals, it’s
interesting to know how the things work out in contrast to the traditional
environment.
1) Stop
guessing
your
capacity
needs:
2) Test
systems at
production
scale:
4) Allow for
evolutionary
architectures:
5) Drive
architectures
using data:
With traditional environments, you probably used models and assumptions
to size your architecture, versus modelling based on larger data sets. When
your application infrastructure is based on code, you can collect data and
understand how changes to it affect the workload. This can help you design
the architecture using a considerably large datasets as compared to your
on-premises infrastructure.
6) Improve
through
game
days:
Finally, in traditional environment you would exercise your runbook only
when something bad has happened. In the cloud these constraints have
been removed. You can afford to simulate events or break things
intentionally and get your team to check your operational readiness and
resiliency to any failures. This is also a good platform when all the
different teams can come together and understand dependencies between
the systems especially in case of failures.
Security
Use a model of 0 trust when thinking about security in the cloud. You
should see application components and services as potentially malicious
entities. This means that we need to apply security at all the levels in the
cloud. The following are the important domains involving security
systems with zero trust in the cloud:
IAM: Identity and access management services in AWS allows you to
create role- based access policies with granular permissions for each user
and service. When working with identity management you need to ensure
that you have strong sign-in mechanisms, store temporary credentials,
audit the roles and credentials periodically and store the passwords
securely. You can leverage AWS services such as IAM, Secrets manager
and AWS Single Sign-On to meet your requirements.
Network security: This means adding multiple layers of defence in your
environment to harden it against any threat. The most basic components to
be used is the VPC which you can create your own virtual private cloud
which is isolated from other customer’s resources. Within the VPC you
should segregate the placement of your resources depending their
requirement to be internet facing or being internal components. On top of
that you should make use of granular controls such as security groups,
network ACLs and route tables to prevent malicious users from gaining
access to your resources. On top of it your environment should be built to
withstand external attacks based on common risks such as OWASP top
then using WAF as well as mitigate volumetric DDOS attacks using Shield
services.
Data encryption and protection: As AWS customers, you are responsible to
protect your data. This should be done starting with the data classification
and dividing into categories based on criticality and access level. Based on
the classification you should design your architecture to use services which
offer expected availability and durability. For example, S3 service offers
99.999999999% durability of your objects. Similarly, in order to protect
your data, you need to have measures in place such as encryption in rest
(KMS) and encryption in transit (SSL/VPN).
Performance efficiency
This pillar focuses on efficiency and scalability. With cloud you can
handle any amount of traffic, so you don’t need to configure your
services with scale in mind. In the on-premises model of doing things,
servers are expensive and it may take a long time to deploy and
configure them. In that model, the servers used are mostly of the same
kind and there may be one server doing multiple functions. The better
way to do in cloud is provisioning a cheap and quick solution, which also
has the freedom to choose the server type that most closely matches the
work load.
Because every server is interchangeable and quick to deploy, we can
easily scale our capacity by adding more servers.
The two concepts for performance efficiency are:
I. Selection: Ability to choose services which match your
workloads. AWS has over 175 services to match your
workload. Achieving performance through selection means
being able to choose the right tool for your job.
Reliability
This pillar focuses on building services with resiliency to both services and
infrastructures disruptions. You should architect your services with
reliability in mind. We can think in terms of blast radius. That is the
maximum impact in event of a failure. To build reliable systems, you need
to minimize the blast radius.
One of the most common components is spreading your resources across
multiple availability zones. You should have automatic triggers in place in
order to mitigate impact to the application in case of certain failures.
While autoscaling is a service which helps you create a fault tolerant
server environment at scale, you should also consider using microservices
based architecture wherever possible. With microservices based
decoupled architectures, changes or failures in one API or components do
not break the functionality of your application entirely and they also help
you recover quickly from failures.
Lastly you should also have DR strategy in place which could either be
in form of a data backup or backup environment in other regions, on-
premises or a multi-cloud environment.
Operational Excellence
This pillar focuses on continuous improvement. You need to think about
automations and eliminating human error. The more operations can be
automated, the less chances of an error. In additional to less error, it also
helps you continuously improve your internal processes.
When you want to gain as much insight into your workload as possible,
then you have to think about the right Telemetry data. Using the most
common sort of monitoring available in form of CloudWatch metrics you
can keep an eye on your resource load. Further you can get additional
logging by pushing your application logs to CloudWatch logs. This does
not stop here. In addition to getting the telemetry data, you need to setup
alerts with the right threshold to make the actual use of the metric and
logging information.
If you have an event for which you are generating an alert for then your
runbook for it to handle it. As an important aspect of the operational
excellence pillar, it is important to automate and improve your reaction to
the alerts constantly. This ensures that your reaction is error free and you
do not have to wake up you’re on- call engineer at 2:00AM in the night
when one of your servers starts throwing HTTP 500 errors.
Cost optimization
This pillar helps you achieve business outcomes while minimizing costs.
Cost optimization in cloud can be explained in terms of Opex instead of
Capex. In simpler terms Opex is pay as you go model, whereas capex is
one-time up-front payment or huge yearly licensing fees. Instead of paying
a huge cost upfront, cloud gives you option to invest in innovation.
With AWS you should make use of tagging of your resources to check the
bill amount corresponding to each project and group of resources. This
helps you identify improvement areas in terms of workload distribution.
You can also make use of findings of services such as Trusted Advisor,
which provide your insight into the utilization of resources and if you
should downsize to an optimal value to reduce your costs. Further you can
setup billing alerts to ensure there are no surprises at the end of the month
because one engineer decided to use a NAT Gateway to download
terabytes of data from S3.
While you can setup your architecture with optimal cost, it is also important
to keep yourself up to data with the latest features and service release as they
can help you further reduce your costs by slightly modifying your
environments. Some of the examples include using Gateway VPC endpoints
instead of NAT Gateways or using shared VPC architectures for services
which are commonly used across your organization and remove duplicate
resources.
You can also ask try to evaluate each component or entity of your
workload separately against the best practices highlighted in the Well-
Architected framework, this will give you granular view of each distinct
function of your environment.
1) Define
your
workload
and stage.
2) Answer the
questions
against
each pillar
of the
workload.
3) Get
summary
of your
workload
review
4) Have
multiple
reviews in
a place in
form of a
dashboard
5) Links to
5) Links to
the best
practices
and videos
as
highlighted
in the
pillars.
6) Archive
and audit
changes to
your
workload
review as
you
progress.
Further read
https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-
Architected_Framework.pdf
Ace the Security pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of
pages of the Well-Architected framework, this guide serves as an
abstract of the security pillar of the Well-Architected framework.
Understanding the security pillar, will ensure that you are equipped
with the knowledge of the best practices of security which you should
implement on your workloads in the cloud.
The Security pillar includes the ability to protect data, systems, and assets to
take advantage of cloud technologies to improve your security. As much as
one can argued, the security pillar is one of the most important pillars of the
well-architected framework. There might be a close competition between the
reliability and security. As without security there is no trust with your
customer and without reliability you are not serving your customer. So
ultimately it may be up to the organization's goals that which one is to be
prioritized. For now, without further ado, let go through the details of the
security pillar.
Components
As any of the pillar of the Well-Architected framework, the security pillar
also covers the following two broad areas:
Design Principals
Definitions
Design Principals
Following the various design principals highlighted in this pillar can help you
with securing your applications:
1. Implement a strong identity foundation
You should be able to monitor at every stack level in your environment. And
this should enable you to get information about any changes in your
environment in as much real time as possible. You also need to have
mechanisms in place by which you can automatically take action
corresponding to the changes. It could be either a motivation action based on
a failed login attempt or a capacity adjustment based on spike in traffic to
your application.
2. Enable traceability
You should be able to monitor at every stack level in your environment. And
this should enable you to get information about any changes in your
environment in as much real time as possible. You also need to have
mechanisms in place by which you can automatically take action
corresponding to the changes. It could be either a notification action based on
a failed login attempt or a capacity adjustment based on spike in traffic to
your application.
3. Apply security at all layers
You also need to have security enabled at all stack levels. This can be at the
edge network layer with the help of secure VPN or a dedicate fibre
connection. At the infrastructure level you should make use of the virtual
private cloud (VPC) and then implement controls at the Network ACLs for
subnet level and security groups for instance level security. In order to tighten
security for your application you should ensure that the end user does not
directly access the instance/OS but accesses through a load balancer. On top
of it if the application supports you may make use of SSO to enable
application access to prevent the compromise of your environment.
4. Automate security best practices
Depending on the data type, they should be classified into high, medium or
low sensitivity levels. Bases on these sensitivity levels you need to have
mechanisms in place encryption the data both in transit (HTTPS/VPN etc)
and at rest (using KMS). You also need to implement access control
mechanisms in place which determine who can read the data and who can
read, modify or copy them.
6. Keep people away from data
Another best practise is to eliminate the need of manually accessing the data
and if required only provide read-only access. You can have these controls in
place by using IAM roles.
7. Prepare for security events
Definitions
AWS outlined five focus areas that encompass security in the cloud:
IAM
Detection
Infrastructure protection
Data protection
Incidence response
Let’s go into the details of each of them to understand them better and how
we can use them secure our Cloud environment.
Imagine you have an employee who has different credentials when accessing
different applications across your on-premises, AWS services and your
application environment. If the employee leaves the organization, it will
become such a tedious task to revoke access from all the systems. This is why
for your workforce; you should have a centralized identity provider to
manage the user identities in one place. You can integrate your external
identity providers such as Okta or ADFS via SAML2.0 with AWS IAM and
enable an authenticated use to access various AWS services.
If you have multiple accounts under one AWS organization, you can also
make use of AWS Single Sign On (AWS SSO) and integrate your identity
provider with it to enable access to AWS services for your authenticated
employee.
2. Leverage user groups and attributes
For both working at scale as well ease of management, you can make use of
user groups to apply similar set of security restrictions. The user groups are
available in AWS SSO as well in IAM user groups.
3. Use strong sign-in mechanisms
You should have a password policy which enforces complex passwords and
this should also be backed by Multi-factor authentication. MFA is supported
for both IAM users as well as through AWS SSO.
4. Use temporary credentials
Depending on the entity accessing the services, you can use different
mechanisms to make them dynamically acquire temporary credentials. Use of
temporary credentials take you away from the risk of compromised
passwords. For your employees you can make use of AWS SSO or use
federation with IAM, if it is a system such as EC2 instance or a lambda
function, then you can make use of IAM roles to provide them with
temporary credentials for accessing AWS services and accounts.
Depending on your application environment, you may also require your end
users to access your AWS resources (for example S3 bucket for uploads).
You can make use of Cognito identity pools for such cases to assign
templorary tokens to the consumers of the application.
5. Audit and rotate credentials periodically
Your password policy should have an expiration period which ensures that
your users are forced to change the password after a pre-determined duration.
You can also make use of AWS Config rules which enforced the IAM
password policy for rotation of credentials.
6. Store and use secrets securely
All credentials which are non-IAM related, such as databases, should not be
stored as plain-text or environment variables, but rather you should make use
of AWS Secrets manager to store the passwords. You should also configure
IAM restrictions to allow only certain users to be able to use the secrets.
Permissions management:
1. Define permission guardrails for your organization
Both users and systems should be granted only the permissions required to do
a specific task. This can also be extended by setting up permissions
boundaries and attribute-based access controls.
3. Analyse public and cross account access
If your AWS resources need a cross account access, that should be done for
trusted account which are part of your AWS organization and you should use
resource policies based to restrict actions. A resource should be made public
only if absolutely required.
4. Share resources securely
For environments spread across multiple accounts, if you need to share your
AWS resources then you should make use of AWS Resource access manager
and share resources to only trusted principals from your AWS Organization.
5. Reduce permissions continuously
You should periodically use IAM user access analyser and CloudTrail to
review unused credentials (users/roles etc) as well as restricts the permissions
attached in IAM policies for all users based on least privileges.
6. Establish emergency access process
Under extreme situations the access to your workloads may break. In such
cases, you should have alternate methods to access your environment by
using cross account roles.
Detection
Apart from collecting all the logs, you should have mechanisms in place to
analyse and identify meaningful information from these to set up
benchmarking of good versus bad logs. AWS Services such as GuardDuty
and Security Hub and aggregate, duplicate and analyse logs received from
other services. These two services alone can help you make sense of the logs
received from VPC flow logs, VPC DNS service, CloudTrail, Inspector,
Macie and Firewall manager. This ensures that you get an overall view of
what is happening in your AWS environment and allows you to route,
escalate, and manage events or findings.
Investigate
1. Implement actionable security events
Your log configuration and analysis are only useful if you have an action plan
or runbook for each type of findings or event. You need to have a
documentation about each type of finding and update it continuously with a
runbook when a particular event occurs.
2. Automate response to events
Infrastructure protection ensures that the services and systems used in your
workload are protected against undetected and unauthorized access, and
potential vulnerabilities. You can achieve this by protecting networks and
protecting compute.
Protecting Networks
1. Create network layers
You should have segregation of resources into public and private networks.
Only the resources such as internet-facing load balancers should be in public
subnets and rest of the resources like webservers, RDS or even managed
services like Lambda should use private subnet to prevent unauthorized
access attempts. For large environments with multiple accounts and VPCs,
you can use resources like Transit Gateway to for the inter-VPC and edge
networking. This ensures that your resources are not exposed to the internet.
2. Control traffic at all layers
Extending the previous point, depending on where you place your resources,
you should have control measures in place to allow only specific type of
traffic. You can achieve this using Network ACLs (NACLs) at the subnet
level and security groups at the instance level. Further you can restrict
unnecessary public level from your resources by the means of using VPC
endpoints, which allow your instances to access Public AWS services using a
secure private channel from the VPC.
3. Implement inspection and protection
For the resources which need public access, you should inspect all type of
traffic which tries to connect to your resources and have rules in place to
protect them from common attacks. This can be done using AWS WAF and
Shield advanced which can be used in front of your workload resources.
4. Automate network protection
Threats do not wait for the specific time when you are online and they may
impose risk to your resources 24*7. So, you should be prepared to
automatically append your security measures based on new sources and
threats as well. AWS WAF managed rules along with WAF security
automations solutions can help you dynamically block certain traffic based
on new and known threats.
Protecting Compute
1. Perform vulnerability management
You can reduce the attack surface by removing unused components in your
operating systems be in software packages or applications. The components
which you need should follow the best practises for hardening and security
guides.
3. Enable people to perform actions at a distance
In order to reduce the risk of human error, you should try to avoid direct
access to the systems as resources as much as possible. This includes SSH,
RDP and AWS Management console access. As an alternative you can use
AWS Systems Managers to run commands in your EC2 instances or use
CloudFormation templates via pipelines to make changes to your
infrastructure environment.
4. Implement managed services
Instead of hosting your own resources for database or containers, you can
make use of Managed services such take care of the provisioning, availability
as well as security patching of the resources. By using services such as
Amazon RDS, ECS or Lambda, you can get AWS to look after the security
aspects of the infrastructure and focus on better things.
5. Validate software integrity
As simple as it may sound, you should ensure that any external code or
packages used in your workloads should be from trusted sources and you
should check that they have not been tempered with by verifying checksums
and certificates from the original author.
6. Automate compute protection
Data protection
Data protection can be categorized into the following 3 categories described
in each of the sections.
Data classification
Once the data has been classified, you can separate them by tagging them
based on classification and use different accounts for sensitive data. The
access to such data can also be controlled based on tag condition on the
resources in the IAM policies.
3. Define data lifecycle management
You can audit the access to your data at rest by using CloudTrail as well as
service level logs such as S3 or VPC flow logs. By analyse the access details
you can put measures in place to prevent unnecessary access as well as
reduce public access as much as possible.
4. Audit the use of encryption keys
If you use KMS to store your encryption keys, then use can analyse the API
calls in CloudTrail to analyse the use of the keys over time and determine of
the usage follows what you intend to implement for your access control.
5. Use mechanisms to keep people away from data
Instead of giving direct access to data for your users, you should make of
services such as Systems Manager to access the EC2 instances and data in
them. Moreover, if instead of handling out raw data directly to the users, you
should share reports to them for sensitive information.
6. Automate data at rest protection
For all kinds of HTTP based access, you need to enforce encrypted data
transfer by using HTTPS and for that you can make use of AWS Certificate
manager with supported services such as ALBs, CloudFront and API
Gateways.
2. Enforce encryption in transit
You should have rules in place which redirect user request from
HTTP to HTTPS to encrypt the use session with your application. On
top of it, you can also have rules and log analysis in place which
check and block insecure protocols like HTTP. For data transfer
between on-prem and AWS, you can configure VPN to encrypt the
data in transit.
3. Authenticate network communications
By using TLS with public endpoints and IPSec with AWS VPN
service you can ensure that there is a mutual authentication in place
between the two parties that intend to establish network
communication.
4. Automate detection of unintended data access
Services such as GuardDuty can analyse log data received from VPC
flow logs, VPC DNS service etc and can help determine malicious
traffic in your environment. Similarly, you can use S3 log analyser to
determine of there are any unintended data access and update your
bucket policies accordingly.
Incidence Response
Despite of all the security measures in place, everything breaks all the time.
Your organization should have process in place which help you respond to
and mitigate the potential impact of security incidents. There are a number of
different approaches you can use when addressing incident response.
Educate
Educating your workforce about your cloud infrastructure and security
measures in place can help them in preparing for incidents in advance. By
investing in people and getting them to develop automation and process
development skills, your organization can benefit in a long run.
Prepare
1. Identify key personnel and external resources
As one of the first steps to prepare for the incidence response situations is
identifications of the service owners internally as well as external resources
such as third-party vendors, AWS Support etc. Your team should have the
details of all the resources well in advance to reduce the response time.
2. Develop incident management plans
Any tools which are needed for investigation purpose, should be pre-
deployment in your infrastructure. This reduces the time to begin
investigation during an event. One of the examples is a package like
tcpdump, if there is a security event and your responder needs to take a
packet capture, it is useful to have the tcpdump tool installed well in advance
on your instances.
5. Prepare forensic capabilities
It is a good idea to have a dedicate system with tools which can enable you to
analyse disk images, file systems and any other artifact that may be involved
in a security event.
Simulate
Run Game Days: Using Game days, you gather all your teams together and
dedicate them to work on simulated real-life scenarios to test your incident
management plans. When you break things in a controlled environment and
some plans do not work, it’s a good learning for teams to re-iterate on their
playbooks and prepared for the real incidents.
Iterate
Automate containment and recovery capability: With event use of your
playbook, you can codify the repeated tasks to further improve response time.
Once your code logic seems to work, you can integrate that directly to the
various event sources so that human interaction is minimum in the incidence
response.
Further Reading
https://d1.awsstatic.com/whitepapers/architecture/AWS-Security-Pillar.pdf
Ace the
Reliability pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of
pages of the Well-Architected framework, this guide serves as an
abstract of the reliability pillar of the Well-Architected framework.
Understanding the reliability pillar, will ensure that you are equipped
with the knowledge of the best practices in order to make your
workload perform its intended function correctly and consistently when
it is intended to.
As AWS CTO Werner Vogel says "Everything fails, all the time", your
workload will also go through all sorts of tests be in spike in load or failure of
more than one component at a time. Due to this it becomes important to keep
reliability in mind when designing your architectures and evolving it over
time. This paper provides in-depth, best practice guidance for implementing
reliable workloads on AWS.
As any of the pillar of the Well-Architected framework, the Reliability pillar
also covers the following two broad areas:
Design Principals
Definitions
Design Principals
Following the various design principals highlighted in this pillar can help you
achieve reliable workloads in the cloud:
1. Automatically recover from failure
Definitions
AWS outlined four focus areas that encompass reliability in the cloud:
Foundations
Workload Architecture
Change Management
Failure Management
Let’s go into the details of each of them to understand the best practices in
these areas to achieve reliability in your cloud-based workloads.
Foundations
To achieve reliability, you must start with the foundations—an environment
where service quotas and network topology accommodate the workload.
These foundational requirements extend beyond a single workload or project
and influence the overall reliability of the system.
Various service limits such as EC2, EBS and EIPs and VPC
quotas exist in the AWS cloud and you should be aware of such
limits by referring to Service Quotas. Using this service, you can
manage up to 100 service limits from a single location.
Additionally, Trusted Advisor checks includes your service quota
usage to take necessary action.
Manage quotas across accounts and regions:
With environments spread across multiple accounts, you need to
keep track of service quotas across all the accounts since service
limits are set on per account and in most cases per region basis as
well. Using AWS Organizations, you can automate updating the
service limits of newly created accounts with the help of pre-
defined template to maintain a uniform structure across the
organization.
Accommodate fixed service quotas and constraints through
architecture
Your failed resources are also considered against your quota until
they are terminated. Considering this you should consider at least
an AZ level failure of resources to calculate the gap.
While architecture for highly reliable environments you should plan for intra-
system and intersystem connectivity between various networks, public and
private IP management and DNS resolution.
Use highly available network connectivity for your workload
public endpoints
Your VPC CIDR range and subnet size cannot be changed once
created, so you need to plan in advance to allocate enough IP
block to accommodate your workload requirements. Additionally,
you need to leave some room for future expansion by keeping
unused space in each subnet.
Prefer hub-and-spoke topologies over many-to-many mesh
If you have only two different networks then you can simply
connect them using 1:1 channel such as VPC peering, DX or AWS
VPN, but as the number of networks grows, the complexity of
meshed connections between all of them becomes untenable.
AWS Transit Gateway can help you maintain a hub and spoke
model, allowing you to route traffic across your multiple
networks.
You cannot natively connect two VPCs or your VPC and on-prem
network if they have overlapping IP address ranges. In order to
avoid such IP conflicts, you should use IP address management
(IPAM) systems to manage allocating of IP address ranges for
various resources.
Workload Architecture
In case of failures, your client should not just constantly retry, but
delay the requests with progressively longer intervals. This is
known as exponential back-off which increase the interval
between subsequent retries.
Fail fast and limit queues
Your services should not store state in memory or in the local disk,
this enabled client requests to be sent to any compute server
without dependency on an existing one in case of failure. For
example, you may a webservice that serves traffic behind a load
balancer and your initial client session is established with
webserver-1 which also maintains the session state. In this case if
the webserver-1 goes down, the webserver-2 will not have any
information about the client state. In such situations, the session
state should be offloaded and maintained in services such as
ElastiCache. For serverless architectures, this can be done using
DynamoDB.
Implement emergency levers
Change Management
Your workload should be designed to adapt to changes, which can be either
due to spike in demand or feature deployments. Here the best practices for
change management:
1. Monitor Workload Resources
Monitoring your workloads enables it to recognize when low-
performance thresholds are cross or failures occur, so that it can
recover automatically in response. There are 4 distinct phases in
monitoring:
Generation
Apart from defining custom metrics from your log data, you
should also analyse the log files for broader trends and getting
insights into your workload. Amazon CloudWatch Logs
Insights comes with both pre-defined queries and ability to
create custom queries to determine trends your workload.
All the above phases should be reviewed periodically to implement
changes as the business priorities change. You can use additional
auditing mechanisms using CloudTrail and AWS config to identify
when and who invoked an AWS API or made an infrastructure
change.
2. Design your Workload to Adapt to Changes in Demand
For your EC2 resources you can use autoscaling to scale out or
in according to demand and trigger these actions by monitoring
specific workload metrics. Usually with the Amazon managed
services such as S3 or Lambda, the scaling activities can be
taken care by service itself and it automatically scales up to
meet the demand.
Configure and use Amazon CloudFront or a trusted
content delivery network
Failure Management
Your workload should be able to withstand failures at any level. This can be
done by using the following strategies:
1 Back
up
Data
Identify and back up all data that needs to be backed up, or
reproduce the data from sources
You should identify all sources of data which if lost can affect
your workload outcomes. This data in turn should be backed up
depending on the resource. Amazon S3 can be one of the most
versatile backup destination and part inbuilt backup capabilities in
services such as EBS, RDS or DynamoDB.
Secure and encrypt backup
Bulkheads in ship are the partitions to divide the ship into different
compartments so that even if one part was damaged, the rest
remained intact. Similar practice and be used by implementing
data partition or using cells for your services. Your customer
requests should be routed to different cells based on shuffle
sharding which ensures that only a limited number of customers
are impacted in case of an event.
3 Design
your
Workload
to
Withstand
Component
Failures
Your workloads should be architected for resiliency and they should be
able to continue running in case of component failures. Some of the ways
in which you can achieve this are:
Based on your RTO and RPO, you can use one of the following
strategies which have different complexities and order of
RTO/RPO:
a) Backup and
restore
(RPO in
hours and
RTO in 24
hours or
less): You
can back up
your data
and
applications
using point
using point
in time
backups in a
DR region
and restore
it to recover
from
disaster.
b) Pilot light
(RPO in
minutes and
RTO in
hours) :
You can
actively
replicate
your data
and
workload
architecture
to a DR
region. In
this case
while your
data is most
up to date,
the resource
will be
switched off
and only
started
when the
DR failure
is invoked.
c) Warm
standby
(RPO in
seconds and
RTO in
RTO in
minutes):
You can
have a fully
functional
version of
your
workload
running in a
DR region,
which
includes
both active
components
as well as
data.
However,
the
components
will be
running a
scaled down
version to
minimize
the cost of
running the
workload.
d) Multi-
region
active/active
(RPO near
zero and
RPO
potentially
zero) : In
this case
you can
have same
have same
copies of
your
workload
running in
two
different
regions,
same as
what you
would do in
different
availability
zones
within a
region.
Since the
workload is
up to date
across
different
regions, you
have the
minimum
downtime in
case of
failure of a
particular
region.
Further Reading
https://docs.aws.amazon.com/wellarchitected/latest/reliability-
pillar/welcome.html
Ace the
Performance Efficiency pillar
Components
As any of the pillar of the Well-Architected framework, the Performance
efficiency pillar also covers the following two broad areas:
Design Principals
Definitions
Design Principals
Following the various design principals highlighted in this pillar can help you
achieve and maintain efficient workloads in the cloud.
1. Democratize advanced technologies
You should make it easier for your teams to adopt and implement new
and complex technologies rather than presenting them with a huge
learning cliff. This can be done by consuming the complex
technologies as a service in the cloud. For example, using Amazon
SQS as your message queuing service for your workload may take off
the heavy lifting of setting up of your own infrastructure for
messaging system and the teams can focus on product development.
2. Go global in minutes
Your architecture should be design in a way that you can deploy your
workload across the multiple AWS regions, which help you reach a
wider audience and benefit the end users with lower latency.
3. Use serverless architectures
Instead, if using traditional servers, you can make use of the serverless
architectures to run your code. This not only removes the burden of
provisioning and maintaining the servers, but also ensure that the
managed services scale according to your workload.
4. Experiment more often
Definitions
AWS outlined four focus areas that encompass performance efficiency in the
cloud:
Selection
Review
Monitoring
Trade-offs
Let’s go into the details of each of them to understand them better and how
we can use them to create an efficient and sustainable workload in the cloud.
Selection
Load test your workload: Once you have deployed resources, you
should load test the environment to see how the workload
performs under stress conditions. By using CloudWatch metrics
you can see the performance of various components and make
changes to meet the desired requirements.
When you select your compute option you should also consider the
inputs such as GPUs, I/Os, memory versus compute intensive and
elasticity.
3. Storage architecture selection
You may also choose more than one storage type depending on your
workloads, for example S3 for the image storage accesses by your users
and Amazon EBS for storing the WordPress files taking care of your
dynamic website. Let’s review the four storage offerings by AWS.
Block Storage: Amazon EBS and EC2 instance store volumes
can be attached to your EC2 instances. They are accessible
from a single EC2 instance and ideal for latency sensitive
applications when the data is mostly accessed from the EC2
instances.
File Storage: Amazon EFS and FSX offer file storage systems
over industry standard as NAS and SMB accessible systems.
They can be accesses by multiple EC2 instances at the same
time and are suitable when a group of servers such as High-
Performance Compute (HPC) cluster need to access a share
file system.
There are a few things which you need to keep in mind before selecting
an optimal network selection.
a) Understand
how
networking
impacts
performance:
Depending
on the
workloads
and their
consumers
factors such
as latency
and
throughput
can lead to
negative or
positive
impact on
performance.
For the
High-
performance
compute
(HPC)
related
workloads,
you need to
keep the
resources in
the cluster as
close as
possible and
you should
place them
in a
placement
group in the
VPC while
taking
advantage of
the
Enhanced
networking
and Elastic
Network
Adaptors
(ENA) on
the EC2
instances.
From the
user
perspective,
perspective,
you should
consider
using Global
accelerator
or
CloudFront
for
minimizing
the latency
and
improving
the delivery
of the
content.
b) Extending
connectivity
to the on-
prem
network: If
your
workloads
have
dependency
on the on-
premises
network, then
depending on
the latency
and
throughput
requirements,
you should
either
consider a
dedicated
Direct
Connect
connection or
site to site
VPN
connectivity.
c) Evaluate
other
networking
features:
There are
cloud
specific
networking
features
which can
which can
help you
reduce costs
as well as
improve the
overall data
transfer
performance.
By making
use of
Gateway and
interface
VPC
endpoints,
you can get
reliable and
private
connectivity
to public
AWS
services such
as S3. This
also reduces
your overall
NAT
Gateway
data
processing
costs. For
Global
networks
you can
leverage the
Latency
based
routing
feature in
Route53 to
route
route
requests to
the closest
end point to
the user’s
location
based on
latency.
d) Choose
location
based on
network
requirements:
There are
applications
sitting in on-
premises
network
which need
to benefit
from the
cloud
offerings but
cannot afford
the latency or
have data
residency
requirements.
Under such
conditions
you can
evaluate
either Local
Zones,
Wavelength
or Outposts,
which take
the AWS
cloud closer
to your on-
prem
workloads
and offer a
unique
hybrid
experience.
Review
AWS continuously innovates to meet customer needs and you should take
advantage of that to evolve your workload.
Stay up-to-date on new resources and services: As new
services, features and design patterns are released you should
identify was to implement them.
Monitoring
You must monitor your architecture’s performance in order to remediate any
issues before they impact your customers. While monitoring consists of the 5
distinct phases of Generation, Aggregation, Real-time processing and
alarming, Storage and Analytics, all these solutions fall into two different
categories:
Active Monitoring: In this type of monitoring, you can simulate
the user experience across various components in your workload
by running certain scripts. These can be as simple as simulating
packet loss in your network by sending ping requests from point
A to point B or creating synthetic HTTP GET/POST requests to
add test entries into your databases.
Trade-offs
You cannot always get the best of all worlds and you need to think of trade-
offs to ensure an optimal approach. For example, a key-value data store can
provide a single millisecond latency for queries but you will have to design
your application to use NoSQL based queries rather than the traditional SQL
based data access pattern. You can use trade-offs to improve performance by
following certain best practices:
Understand the areas where performance is most critical:
You should identify the areas where the performance of your
workload will have better customer experience. For example, a
website service large number of customers spread globally would
improve user experience by using CloudFront for content
delivery.
Learn about design patterns and services: There are many
reference architectures which may be different from your current
workload environment but by adopting them you can make them
more efficient. The Amazon’s builder library contains various
tried and test reference architectures and methods which you can
use to innovate your resources with certain trade-offs but achieve
eventual efficiency.
Further Reading
https://d1.awsstatic.com/whitepapers/architecture/AWS-Performance-
Efficiency-Pillar.pdf
Ace the Operational Excellence
pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of pages
of the Well-Architected framework, this guide serves as an abstract of the
Operational excellence pillar of the Well-Architected framework.
Understanding the Operational excellence pillar, will ensure that you are
equipped with the knowledge of the operational best practices which you can
apply on your workloads in the cloud
Design Principals
There are several design principles that are highlighted in the operation
excellence pillar, part of a well architected framework that you need to
consider following if you wish to achieve operational excellence for your
applications that are hosted at AWS:
8. Perform operations as code
Since your cloud infrastructure is all software and has virtual components,
they can all be automated. You can start by using Infrastructure as a code
services to deploy the environments and then introduce automated in logging,
monitoring and change management.
9. Make frequent, small, reversible changes
Your environment will need changes over time and you should determine a
frequency of these changes depending on their impact and also ensure they
are frequent but at the same time small changes which do not impact
customers. You should make use of snapshots wherever possible to reverse
those changes if required.
10. Refine operations procedures frequently
You need to prepare for failure by proactively testing various scenarios and
having automations in place to remove or mitigate the failures.
12. Learn from all operational failures
Any kind of operational failure should be taken as a lesson to not repeat the
same mistake again. This starts with a post mortem and includes corrective
actions and ends with sharing information at all organizational levels.
Definitions
AWS outlined five focus areas that encompass Operational excellence in the
cloud:
Organization
Prepare
Operate
Evolve
Let’s go into the details of each of them to understand them better and how
we can use them secure our Cloud environment.
Organization
Organization Priorities
Understanding the organizational priorities and reviewing them timely
ensures that you reap the benefits of all the efforts that have been put in place
and achieve your business goals.
1. Evaluate external customer needs
Your internal customers, which include your workforce and other teams in
organization as equally important as your external customers. The key
stakeholders should folks on their needs as well.
You decide your priorities, you should be aware of the overall management
approach decided by the senior executives to control and direct your
organization. This ensures that strategies, directives and instructions from
management are carried out systematically and effectively.
4. Evaluate external compliance requirements
Similar to governance requirements, by being aware or the compliance
requirements, your organization can demonstrate you have conformed to
specific requirements in laws, regulations, contracts, strategies and policies.
5. Evaluate threat landscape
In case of multiple alternatives and objectives, you need to consider the trade-
off of choosing one over other and if losing something means forgoing a
benefit or opportunity against your business outcomes.
7. Manage benefits and risks
You need to be able to take calculated risks and take decisions which can be
reversed. Considering a benefit of a particular change may result in being
exposed to a risk, however if you can manage that or easily revert the change
then that is worth trying.
Operating Model
A well-defined organizational operating model, gives a clear understanding
of responsibilities will reduce the frequency of conflicting and redundant
efforts. This further helps achieve business outcomes due to a strong
alignment and relationships between business, development, and operations
teams.
There are two main aspects of Operating model:
1) Operating
Model 2 by 2
Representations:
With the help of
various
illustrations,
you can
understand the
relationship
between teams
in your
environment.
You can use one
or more than
one of the
operating
models
depending on
your
organizational
strategy or stage
of development
Fully Separated Operating Model
In this model while the Application and Infrastructure teams still have
separate responsibilities, with the decentralized governance, the
application team has fewer constraints. They are free engineer and
operate new platform capabilities in support of their workload.
2) Relationships
and
Ownership
You may choose any type of operating model; however, you need to
have a clear understanding of the ownership of various resources and
processes. The team members should be well aware of their
responsibilities and identified owners need to have a set performance
target to continuously improve the business outcomes.
Organizational Culture
Often org culture may be very ingrained in the org. You have to put this in
your culture that you are supporting your team members effectively. So that
the team members can support operations and help you realize the desired
business outcome.
1. Executive Sponsorship
Your Senior Leadership should be the sponsor, advocate, and driver for the
adoption of best practices and evolution of the organization.
The operations team members should have enough resources and escalations
mechanisms to take respond to events which may impact business outcomes.
3. Escalation is encouraged
Continuing on the empowerment bit, the team members should not hessite to
escalate to the highest authorities to move things and practise should be
followed to escalate in time.
4. Communications are timely, clear, and actionable
You should seek diverse perspective about your approaches through cross
team collaboration, this reduces the risk of confirmation bias and leads to
multiple idea generation.
Prepare
Prepare is all about setting things up for success through telemetry,
development tool chain and making informed decisions. There are four main
areas of engagement:
Design Telemetry
Telemetry allows you to gather important data from your resources and keep
you in control of your workload.
1. Implement application telemetry
Similar to application, your workloads should also publish data related to its
status. Almost all the services have relevant CloudWatch metrics which you
can monitor and for the custom workloads, you can publish custom metrics to
CloudWatch to monitor metrics such as HTTP status codes, API latency etc.
3. Implement user activity telemetry
Apart from the critical components, you should also configure your
workloads to emit telemetry data for resources on which your workloads are
dependent on. This could be vendor systems, internet weather or external
databases.
5. Implement transaction traceability
Version control for your workloads enable you to track changes and releases
which can further help you to introduce small incremental changes as well as
help in rollback whenever necessary. Services such as AWS CodeCommit
and CloudFormation can help you in managing version control of your
infrastructure as well as code.
2. Test and validate changes
With cloud you don’t have to worry about the cost and time to set up
new infrastructure. For any changes you can follow A/B testing or
blue-green deployments with a parallel infrastructure and test your
changes.
3. Use configuration management systems
Similar to config management, you can also your build and deployment
systems in form of CI/CD pipelines using AWS developer tools such as Code
Commit, Code Deploy and Code Pipeline
5. Perform patch management
With the help of timed snapshots or backups and a rollback plan in place, you
can ensure that even if there is a failed deployment your production
environment can continue to run as desired.
2. Test and validate changes
Any changes in your lifecycle stages can be tested and validated by creating
parallel systems. You can also make sure of AWS CloudFormation to deploy
changes which allow you to see the effect of a drift and also easily rollback
them.
3. Use deployment management systems
Instead of full-scale changes you can test using deployment canary testing or
one-box deployments to confirm the desired outcome of your changes.
5. Deploy using parallel environments
With the help of blue green deployments, you can deploy changes in a new
environment and route traffic to the new ones. A simple example would be
creating a new load balancer and ec2 instances for the new environment, once
you are ready to move the changes to production, you can simply change the
DNS record in Route53 to point to the new ALB.
6. Deploy frequent, small, reversible changes
With the help of canaries and various test benches, testing of your changes
should be automated with mechanisms in place to automatically roll them
back for a minimal production impact.
Understand Operational Readiness
1. Ensure personnel capability
With the help of AWS training resources, you can ensure that your workforce
from various domains is equipped with the knowledge to run operations
successfully.
2. Ensure consistent review of operational readiness
You should analyse the benefit and risk of your deployments before making
changes and evaluate against your workforce capabilities and governance
requirements.
Operate
You need to identify the key performance indicators of your business and
customer outcomes which determine the if your workload is working
efficiently towards desired results. These can be in terms of orders, revenue,
customer satisfaction score etc.
2. Define workload metrics
Based on your KPIs you should measure the performance of your workload.
Relating the business performance with the workload metrics can help you
benchmark your operational efforts. With the help of CloudWatch log agent
on your server, you can determine the custom metric which you need to
define and monitor.
3. Collect and analyze workload metrics
An appropriate baseline for each metric to setup which helps you identify if
your workload is delivering expected results. If there is a a metric exceeding
threshold, then it should result in a trigger to investigate.
5. Learn expected patterns of activity for workload
If you are using metrics to monitor your workload performance, you should
have alarms in place to alert you if they exceed a certain threshold.
7. Alert when workload anomalies are detected
You need to identify the key performance indicators of your business and
customer outcomes which determine the if your operations are working
aligned towards desired results. These can be in terms of new feature
releases, customer cases or uptime of your services.
2. Define operations metrics
You should have operational metrics by which you can measure their
effectiveness in achieving the KPIs. For example, Time to resolve (TTR) an
incident or new deployment success rate can be considered as metrics to
measure your business KPIs of uptime or new feature release.
3. Collect and analyze operations metrics
Once the metrics are in place, you can baseline them according to your
business outcomes and put in efforts to improve them towards a
benchmarking criterion.
5. Learn expected patterns of activity for operations
If your metrics go beyond a threshold which impact your workloads, you can
configure alerts with the help of CloudWatch alarms to be notified in time.
7. Alert when operations anomalies are detected
You may not always have set threshold for your metrics and in these case
CloudWatch Anomaly detection can help you by identifying expected values
by pattern recognition and machine learning.
Your business and leadership team should review your operational KPIs and
metrics to see if they align with the business goals and provide
recommendations if necessary.
Respond to Events
1. Use processes for event, incident, and problem management
You should have a playbook every alert and that should be updated regularly
to avoid surprises in case of an incidence. These processes should have a
well-defined owner and emphasis should be laid on making them automated
as much as possible.
3. Prioritize operational events based on business impact
With the help of push notifications in form of email or SMS, you can keep
your users aware of any service impact and progress to the investigation.
6. Communicate status through dashboards
You can have internal and external dashboards to communicate the status of
your services and appropriate metrics. With the help of CloudWatch
dashboard and Amazon QuickSight various stakeholders can be made aware
of the latest status and use the data to relate to other dependent services.
7. Automate responses to events
With every iteration you should try to automate the most common scenarios
to reduce error and time take to remediate problems. With CloudWatch
alarms you can define actions specific to EC2 or use SNS to trigger lambda
functions for custom logic.
Evolve
You should have processes in place by which you can review the root
causes of customer or business impacting events. Rather than putting
a blame on anyone, efforts should be put to ensure that the same
mistakes or errors do not happen again. A simple exercise would be
assembling all the stake holders and asking the 5 WHYs.
3. Implement feedback loops
With the help of feedback loops in your workload and processes, you
can be informed about issues and improvement areas on a recurring
basis. This helps your environment evolve over time.
4. Perform Knowledge Management
Your workforce should be equipped with the knowledge they need to do their
job effectively. This means that the contents should be refreshed regularly
and old information archived.
5. Define drivers for improvement
Keep shared repository of your documentation for all your teams which can
be used by them as a reference to avoid repeating same mistakes again.
9. Allocate time to make improvements
You can set up dedicated time for improvement of your operations and gather
cross-team members to participate in activities. By setting up parallel
environments and manually breaking them can help you test your process and
tools and come up with improvement plans.
Further Reading
https://d1.awsstatic.com/whitepapers/architecture/AWS-Operational-
Excellence-Pillar.pdf
Ace the
Cost Optimization pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of
pages of the Well-Architected framework, this guide serves as an
abstract of the cost-optimization pillar of the Well-Architected
framework. Understanding the cost-optimization pillar, will ensure that
you are equipped with the knowledge of the best practices for
achieving cost efficiency which you should implement on your
workloads in the cloud.
One of the things which has been least well understood as customers
transition into cloud around cost optimization. One of the challenges in the
past was that people building the systems be programmers or architectures
rarely had the access to the cost of the components that they are using to
build those systems. They were using servers and using databases but never
exposed to the cost for these components. And with cloud this is changing
and getting more and more engineers being aware about cost of the
components rather than just the finance guys. The cost optimization pillar
seeks to empower you to maximize value from your investments, improve
forecasting accuracy and cost predictability, create a culture of ownership and
cost transparency, and continuously measure your optimization status. Let’s
get started.
Components
As any of the pillar of the Well-Architected framework, the Cost
Optimization pillar also covers the following two broad areas:
Design Principals
Definitions
Design Principals
Following the various design principals highlighted in this pillar can help you
with securing your applications:
1. Implement cloud financial management
You need to measure the impact of business outcomes like revenue, customer
gains etc against the input cost of running your workloads. This helps you
understand the impact of increase in cost for your business goals.
4. Stop spending money on undifferentiated heavy lifting
By properly tagging your resources, you can identify the cost per project,
workload or department. This helps establish the ROI on your business
efforts and accordingly create a room for optimization.
Definitions
AWS outlined five focus areas that encompass cost optimization in the cloud:
Practice Cloud Financial Management
Expenditure and usage awareness
Cost-effective resources
Manage demand and supplying resources
Optimize over time
Let’s go into the details of each of them to understand them better and how
we can use them to optimize costs in our Cloud environment.
Practice Cloud Financial Management
Similarly, the product and technology leads should understand the budgets
and service level agreements. These financial requirements should be kept in
mind while designing cloud-based workloads for your business applications.
This partnership helps both the teams have real-time visibility into costs and
also establish a standard operating procedure to handle variance in cloud
spending.
Additionally, business unit owners and third parties should understand the
cloud business model so that they are aligned with the financial goals and
work towards optimal return of investments (ROI)
Cloud Budgets and Forecasts
The efficiency, speed and agility offered by cloud means there can be high
variable amount of cost and usage. You can use AWS cost explorer to
forecast daily or monthly cloud costs based on your historical cost trends.
Your existing budgeting and forecasting processes should be modified to take
inputs from AWS cost explorer to identify trends and business drivers.
Cost-Aware Processes
5. Implement cost awareness in your organizational processes
By using AWS cost explorer and AWS Budgets you can regular
report cost and usage optimization withing your organization. This
should not be limited to the management or financial teams but should
be extended to all the stakeholders including technology teams. You
can further customize reports with the Cost and Usage Report (CUR)
data with Amazon QuickSight which can help create reports
according to target audiences.
7. Monitor cost and usage proactively
3. Decommissioning Resources
Cost-effective resources
With cloud you might be doing 10 times what you used to do and achieving
the outcomes accordingly. It becomes necessary to choose appropriate
resources, services and configuration for your workload to achieve cost
savings. The following aspects should be considered:
1. Evaluate cost when selecting services
Since cloud offers you the pay as you go model, you can eliminate the need
for costly as wasteful overprovisioning. With on-demand provisioning of
resources you can ensure that you have resources running only when you
need them and scale up or down when required. You can use the following
approaches to manage demand and supplying resources:
1. Analyse the workload
3. Dynamic Supply
As new services and features are released your review process should
consider implementing them after analysing the business impact of
making the changes. There are various AWS blog posts channels
which your teams can subscribe to in order to be up to date with the
latest offerings.
The answers to these questions during the review help you identify any gaps
in your existing workloads and implement the best practices in your AWS
environment.
Further Reading
https://d0.awsstatic.com/whitepapers/architecture/AWS-Cost-Optimization-
Pillar.pdf