0% found this document useful (0 votes)
115 views

Ace the AWS Well Architected Framework_ Learn, Measure

The document outlines the AWS Well-Architected Framework, emphasizing the importance of adhering to best practices when developing and deploying cloud applications. It details the framework's three main components: General Design Principles, the Five Pillars (Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization), and the Review Process. Additionally, it provides insights into the significance of security in cloud architectures and encourages continuous improvement through regular reviews and assessments.

Uploaded by

findurmilarai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views

Ace the AWS Well Architected Framework_ Learn, Measure

The document outlines the AWS Well-Architected Framework, emphasizing the importance of adhering to best practices when developing and deploying cloud applications. It details the framework's three main components: General Design Principles, the Five Pillars (Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization), and the Review Process. Additionally, it provides insights into the significance of security in cloud architectures and encourages continuous improvement through regular reviews and assessments.

Uploaded by

findurmilarai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

Index

Ace the AWS Well-Architected Framework


Ace the Security pillar
Ace the Reliability pillar
Ace the Performance Efficiency pillar
Ace the Operational Excellence pillar

Ace the Cost Optimization pillar


Ace the AWS Well-Architected
Framework
When you are developing and deploying applications in the cloud, its easy
to get too much focused on the end result due requirements of the speed
and agility. However, it is important to keep in mind that creating
technology solution's is a lot like constructing a physical building. If the
foundation isn't solid, it may cause structural problems and undermine the
integrity and function of that building. Starting from the initial design phase
to the recurring review of your workload, it is important to ensure that your
architecture is in accordance with the best practises. This ensures that your
workload environment continues to function and deliver results, the way it
is supposed to do in the short as well as long run.
The AWS Well-Architected Framework is a collection so such best
practises which you can benchmark your systems against and ensure that
there are no surprises at any stage of your journey in the cloud. The
document itself is very detailed document and at some point you could
feel like drinking from a firehose. You can decide to use your weekend's
time to go through it, but for now, we come up with a short summary of it
in order to ignite your interest in following the best practises in the cloud.

Introduction
The well-architected framework is a framework to measure applications
running in the cloud against a set of strategies and best practises. The
framework has been compiled by AWS after working closely with 1000s
of customers. The purpose of this document is to empower people to make
informed decisions about their cloud architectures and help them
understand the impact of their decisions. The program is a starting point in
the architectural process that should be used as a catalyst in further
thinking and conversations.
General Design Principals
There are 3 main components of the well-architected framework:

• General
Design
Principals
• The Five
Pillars
• Review
(or
Questions)
Let’s go through these components one by one.

Cloud computing has opened up the technology space to a whole new lot
of thinking where constraints that we used to have in the traditional
environment no longer exist. When thinking about design principals, it’s
interesting to know how the things work out in contrast to the traditional
environment.

1) Stop
guessing
your
capacity
needs:

In the traditional environment you had to guess your how much


infrastructure you needed and that was often based on high level business
requirements and demand. This exercise us usually done before even a
line of code was written. With cloud-computing, you no longer have to do
the guess work. You can start with a bare min capacity needed and scale
up on demand to whatever and whenever you need.

2) Test
systems at
production
scale:

In your on-premises environments, you could not afford to test at


scale. i.e. it was simply economically not a feasible thing to do. So, you
went to production and you would see a whole new class of issues that
would arise at high scale. In cloud, you can create test scenarios at
production scale and calibrate your resources and terminate them after
you are done. This ensures that there are no surprises when there is a
huge event planned and the numbers given your marketing guys don’t
match up with the actual traffic on your application.
3) Automate to
make
architectural
experimentation
easier:
With cloud native workloads you easily replicate to workload through
automation.
This allows you to run your do experiments on your architectures easily
without putting in manual efforts and downtimes expected in traditional
workloads.

4) Allow for
evolutionary
architectures:

Any proof of concepts or architectural experimentation in traditional


environments was done by hand and was only generally done at the start of
a project. You generally had to have a static architecture, that was confined
and it was difficult to even think about making a change. With cloud you
can automate your changes and test them. This gives business more scope
to innovate at a faster pace.

5) Drive
architectures
using data:
With traditional environments, you probably used models and assumptions
to size your architecture, versus modelling based on larger data sets. When
your application infrastructure is based on code, you can collect data and
understand how changes to it affect the workload. This can help you design
the architecture using a considerably large datasets as compared to your
on-premises infrastructure.

6) Improve
through
game
days:
Finally, in traditional environment you would exercise your runbook only
when something bad has happened. In the cloud these constraints have
been removed. You can afford to simulate events or break things
intentionally and get your team to check your operational readiness and
resiliency to any failures. This is also a good platform when all the
different teams can come together and understand dependencies between
the systems especially in case of failures.

The Five Pillars

The Five Pillars of the Well-Architected Framework i.e. Operational


Excellence, Security, Reliability, Performance Efficiency and Cost
Optimization help to provide a stable and consistent base which you can
use to initially design your infrastructure and keep referring back to as the
infrastructure evolves. If you go through the details of each pillar
individually, you would also learn about focus areas in them with respect to
design principals for each pillar, definitions and best practises. Here is a
summarized version of the 5 pillars

Security
Use a model of 0 trust when thinking about security in the cloud. You
should see application components and services as potentially malicious
entities. This means that we need to apply security at all the levels in the
cloud. The following are the important domains involving security
systems with zero trust in the cloud:
IAM: Identity and access management services in AWS allows you to
create role- based access policies with granular permissions for each user
and service. When working with identity management you need to ensure
that you have strong sign-in mechanisms, store temporary credentials,
audit the roles and credentials periodically and store the passwords
securely. You can leverage AWS services such as IAM, Secrets manager
and AWS Single Sign-On to meet your requirements.
Network security: This means adding multiple layers of defence in your
environment to harden it against any threat. The most basic components to
be used is the VPC which you can create your own virtual private cloud
which is isolated from other customer’s resources. Within the VPC you
should segregate the placement of your resources depending their
requirement to be internet facing or being internal components. On top of
that you should make use of granular controls such as security groups,
network ACLs and route tables to prevent malicious users from gaining
access to your resources. On top of it your environment should be built to
withstand external attacks based on common risks such as OWASP top
then using WAF as well as mitigate volumetric DDOS attacks using Shield
services.
Data encryption and protection: As AWS customers, you are responsible to
protect your data. This should be done starting with the data classification
and dividing into categories based on criticality and access level. Based on
the classification you should design your architecture to use services which
offer expected availability and durability. For example, S3 service offers
99.999999999% durability of your objects. Similarly, in order to protect
your data, you need to have measures in place such as encryption in rest
(KMS) and encryption in transit (SSL/VPN).

Performance efficiency
This pillar focuses on efficiency and scalability. With cloud you can
handle any amount of traffic, so you don’t need to configure your
services with scale in mind. In the on-premises model of doing things,
servers are expensive and it may take a long time to deploy and
configure them. In that model, the servers used are mostly of the same
kind and there may be one server doing multiple functions. The better
way to do in cloud is provisioning a cheap and quick solution, which also
has the freedom to choose the server type that most closely matches the
work load.
Because every server is interchangeable and quick to deploy, we can
easily scale our capacity by adding more servers.
The two concepts for performance efficiency are:
I. Selection: Ability to choose services which match your
workloads. AWS has over 175 services to match your
workload. Achieving performance through selection means
being able to choose the right tool for your job.

II. Scaling: Choosing how the service scale is important to


continue performing. There are two type of scaling a)
Vertical scaling: Increase size b) Horizontal scaling: Increase
numbers

Consider an example of your web application server using EC2


instances in an auto-scaling group. As your traffic increases, you
can have scaling policies in place which automatically add new
instances as your load balancer targets when the threshold
increases pre-set limits. Similarly, for SQL based databases such as
Aurora, you can configure vertical scaling to increase your
database capacity. It is important to continuously monitor your
application workload and measure it against specific metrics to set
a trigger point for scaling.

Reliability
This pillar focuses on building services with resiliency to both services and
infrastructures disruptions. You should architect your services with
reliability in mind. We can think in terms of blast radius. That is the
maximum impact in event of a failure. To build reliable systems, you need
to minimize the blast radius.
One of the most common components is spreading your resources across
multiple availability zones. You should have automatic triggers in place in
order to mitigate impact to the application in case of certain failures.
While autoscaling is a service which helps you create a fault tolerant
server environment at scale, you should also consider using microservices
based architecture wherever possible. With microservices based
decoupled architectures, changes or failures in one API or components do
not break the functionality of your application entirely and they also help
you recover quickly from failures.
Lastly you should also have DR strategy in place which could either be
in form of a data backup or backup environment in other regions, on-
premises or a multi-cloud environment.

Operational Excellence
This pillar focuses on continuous improvement. You need to think about
automations and eliminating human error. The more operations can be
automated, the less chances of an error. In additional to less error, it also
helps you continuously improve your internal processes.
When you want to gain as much insight into your workload as possible,
then you have to think about the right Telemetry data. Using the most
common sort of monitoring available in form of CloudWatch metrics you
can keep an eye on your resource load. Further you can get additional
logging by pushing your application logs to CloudWatch logs. This does
not stop here. In addition to getting the telemetry data, you need to setup
alerts with the right threshold to make the actual use of the metric and
logging information.
If you have an event for which you are generating an alert for then your
runbook for it to handle it. As an important aspect of the operational
excellence pillar, it is important to automate and improve your reaction to
the alerts constantly. This ensures that your reaction is error free and you
do not have to wake up you’re on- call engineer at 2:00AM in the night
when one of your servers starts throwing HTTP 500 errors.

Cost optimization
This pillar helps you achieve business outcomes while minimizing costs.
Cost optimization in cloud can be explained in terms of Opex instead of
Capex. In simpler terms Opex is pay as you go model, whereas capex is
one-time up-front payment or huge yearly licensing fees. Instead of paying
a huge cost upfront, cloud gives you option to invest in innovation.
With AWS you should make use of tagging of your resources to check the
bill amount corresponding to each project and group of resources. This
helps you identify improvement areas in terms of workload distribution.
You can also make use of findings of services such as Trusted Advisor,
which provide your insight into the utilization of resources and if you
should downsize to an optimal value to reduce your costs. Further you can
setup billing alerts to ensure there are no surprises at the end of the month
because one engineer decided to use a NAT Gateway to download
terabytes of data from S3.
While you can setup your architecture with optimal cost, it is also important
to keep yourself up to data with the latest features and service release as they
can help you further reduce your costs by slightly modifying your
environments. Some of the examples include using Gateway VPC endpoints
instead of NAT Gateways or using shared VPC architectures for services
which are commonly used across your organization and remove duplicate
resources.

The Review Process


This is the stage where you get things in action. The review process is
where you get your team and resources in one place and answer a set of
questions to see if your architecture meets the best practises highlighted in
the 5 pillars. The answers to the question during the process of going
through the Well-Architected Framework aren't right or wrong or yes or
no. They are on the business realities and choices that are made for a given
system at a given time. So, when working on an well- architected
engagement, don’t just go through the questions but also use your
knowledge and expertise to build off the framework and do what's right for
your given situation. Here are some sample questions from each of the
pillars.

SEC 3: How do you manage permissions for people and machines?


Manage permissions to control access to people and machine identities
that require access to AWS and your workload. Permissions control who
can access what, and under what conditions

OPS 6: How do you mitigate deployment risks?


Adopt approaches that provide fast feedback on quality and enable rapid
recovery from changes that do not have desired outcomes. Using these
practices mitigates the impact of issues introduced through the deployment
of changes.

PERF 5 How do you configure your networking solution?


The optimal network solution for a workload varies based on latency,
throughput requirements, jitter, and bandwidth. Physical constraints,
such as user or on-premises resources, determine location options.
These constraints can be offset with edge locations or resource
placement.
REL 7: How do you design your workload to adapt to changes in demand?
A scalable workload provides elasticity to add or remove resources
automatically so that they closely match the current demand at any
given point in time.
COST 4: How do you decommission resources?
Implement change control and resource management from project
inception to end-of-life. This ensures you shut down or terminate unused
resources to reduce waste.

You can also ask try to evaluate each component or entity of your
workload separately against the best practices highlighted in the Well-
Architected framework, this will give you granular view of each distinct
function of your environment.

Well Architected Tool


Usually customers are assisted by AWS solution's architected and
Technical account managers. However not all customers get SAs and
TAMs, so they can either make use of third-party vendors to do the
review process or make use of the Well-Architected Tool from their
AWS Console. This tool allows you to go do the following:

1) Define
your
workload
and stage.
2) Answer the
questions
against
each pillar
of the
workload.
3) Get
summary
of your
workload
review
4) Have
multiple
reviews in
a place in
form of a
dashboard
5) Links to
5) Links to
the best
practices
and videos
as
highlighted
in the
pillars.
6) Archive
and audit
changes to
your
workload
review as
you
progress.

Depending on your organizational cultural you can do a recurring review


every few months, however it is important to do the review before the start
of the design phase of any project. This helps you create a stable
foundation for your environment and all you need to worry about is the
functional working of your application.

Further read
https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-
Architected_Framework.pdf
Ace the Security pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of
pages of the Well-Architected framework, this guide serves as an
abstract of the security pillar of the Well-Architected framework.
Understanding the security pillar, will ensure that you are equipped
with the knowledge of the best practices of security which you should
implement on your workloads in the cloud.

The Security Pillar

The Security pillar includes the ability to protect data, systems, and assets to
take advantage of cloud technologies to improve your security. As much as
one can argued, the security pillar is one of the most important pillars of the
well-architected framework. There might be a close competition between the
reliability and security. As without security there is no trust with your
customer and without reliability you are not serving your customer. So
ultimately it may be up to the organization's goals that which one is to be
prioritized. For now, without further ado, let go through the details of the
security pillar.

Components
As any of the pillar of the Well-Architected framework, the security pillar
also covers the following two broad areas:
Design Principals
Definitions

Design Principals

Following the various design principals highlighted in this pillar can help you
with securing your applications:
1. Implement a strong identity foundation

You should be able to monitor at every stack level in your environment. And
this should enable you to get information about any changes in your
environment in as much real time as possible. You also need to have
mechanisms in place by which you can automatically take action
corresponding to the changes. It could be either a motivation action based on
a failed login attempt or a capacity adjustment based on spike in traffic to
your application.
2. Enable traceability

You should be able to monitor at every stack level in your environment. And
this should enable you to get information about any changes in your
environment in as much real time as possible. You also need to have
mechanisms in place by which you can automatically take action
corresponding to the changes. It could be either a notification action based on
a failed login attempt or a capacity adjustment based on spike in traffic to
your application.
3. Apply security at all layers

You also need to have security enabled at all stack levels. This can be at the
edge network layer with the help of secure VPN or a dedicate fibre
connection. At the infrastructure level you should make use of the virtual
private cloud (VPC) and then implement controls at the Network ACLs for
subnet level and security groups for instance level security. In order to tighten
security for your application you should ensure that the end user does not
directly access the instance/OS but accesses through a load balancer. On top
of it if the application supports you may make use of SSO to enable
application access to prevent the compromise of your environment.
4. Automate security best practices

Instead of using manual processes, your controls should be designed and


managed as code or scripts. For example, you should make use of the
CloudWatch metric service to general an alarm based on your pre-defined
thresholds and then use alarm to notify services such as Lambda to take a
custom pre-defined action as per your security best practices.
5. Protect data in transit and at rest

Depending on the data type, they should be classified into high, medium or
low sensitivity levels. Bases on these sensitivity levels you need to have
mechanisms in place encryption the data both in transit (HTTPS/VPN etc)
and at rest (using KMS). You also need to implement access control
mechanisms in place which determine who can read the data and who can
read, modify or copy them.
6. Keep people away from data

Another best practise is to eliminate the need of manually accessing the data
and if required only provide read-only access. You can have these controls in
place by using IAM roles.
7. Prepare for security events

Conducting security incident simulations is a valuable exercise for


organizations. This is a useful tool to improve how an organization handles
security events. These simulations can be tabletop sessions, individualized
labs, or full team exercises conducted using simulated environment. Based on
learnings or findings from these exercises, you can further strengthen controls
according to your organizational requirements.

Definitions
AWS outlined five focus areas that encompass security in the cloud:
IAM
Detection
Infrastructure protection
Data protection
Incidence response

Let’s go into the details of each of them to understand them better and how
we can use them secure our Cloud environment.

Identity and Access Management


All infrastructures have measures in place which ensure that, you know who
users are, you can identify them and then set appropriate level of
authorization. There are a variety of AWS services you can use to enforce
this and these capabilities basically fall into two main areas of Identity
management and permissions management.
Identity management:
As the name says, it basically identifies the entity which is trying to interact
with your environment. The entity can be a Human in the form of your
administrator, developer, third party vendor or end user who is the consumer
of your application. It can also be a Machine, in form of your infrastructure
and applications such as EC2 instance, Lambda functions or an external
server sitting outside of AWS. You should follow the below mentioned best
practises for robust identification management.
1. Rely on a centralized identity provider

Imagine you have an employee who has different credentials when accessing
different applications across your on-premises, AWS services and your
application environment. If the employee leaves the organization, it will
become such a tedious task to revoke access from all the systems. This is why
for your workforce; you should have a centralized identity provider to
manage the user identities in one place. You can integrate your external
identity providers such as Okta or ADFS via SAML2.0 with AWS IAM and
enable an authenticated use to access various AWS services.
If you have multiple accounts under one AWS organization, you can also
make use of AWS Single Sign On (AWS SSO) and integrate your identity
provider with it to enable access to AWS services for your authenticated
employee.
2. Leverage user groups and attributes

For both working at scale as well ease of management, you can make use of
user groups to apply similar set of security restrictions. The user groups are
available in AWS SSO as well in IAM user groups.
3. Use strong sign-in mechanisms

You should have a password policy which enforces complex passwords and
this should also be backed by Multi-factor authentication. MFA is supported
for both IAM users as well as through AWS SSO.
4. Use temporary credentials

Depending on the entity accessing the services, you can use different
mechanisms to make them dynamically acquire temporary credentials. Use of
temporary credentials take you away from the risk of compromised
passwords. For your employees you can make use of AWS SSO or use
federation with IAM, if it is a system such as EC2 instance or a lambda
function, then you can make use of IAM roles to provide them with
temporary credentials for accessing AWS services and accounts.
Depending on your application environment, you may also require your end
users to access your AWS resources (for example S3 bucket for uploads).
You can make use of Cognito identity pools for such cases to assign
templorary tokens to the consumers of the application.
5. Audit and rotate credentials periodically

Your password policy should have an expiration period which ensures that
your users are forced to change the password after a pre-determined duration.
You can also make use of AWS Config rules which enforced the IAM
password policy for rotation of credentials.
6. Store and use secrets securely

All credentials which are non-IAM related, such as databases, should not be
stored as plain-text or environment variables, but rather you should make use
of AWS Secrets manager to store the passwords. You should also configure
IAM restrictions to allow only certain users to be able to use the secrets.

Permissions management:
1. Define permission guardrails for your organization

As your workloads grow, you should create multiple accounts to manage


them and then use AWS Organizations and service control policies to restrict
access based on accounts, regions, services and other conditions.
2. Grant least privilege access

Both users and systems should be granted only the permissions required to do
a specific task. This can also be extended by setting up permissions
boundaries and attribute-based access controls.
3. Analyse public and cross account access

If your AWS resources need a cross account access, that should be done for
trusted account which are part of your AWS organization and you should use
resource policies based to restrict actions. A resource should be made public
only if absolutely required.
4. Share resources securely

For environments spread across multiple accounts, if you need to share your
AWS resources then you should make use of AWS Resource access manager
and share resources to only trusted principals from your AWS Organization.
5. Reduce permissions continuously

You should periodically use IAM user access analyser and CloudTrail to
review unused credentials (users/roles etc) as well as restricts the permissions
attached in IAM policies for all users based on least privileges.
6. Establish emergency access process

Under extreme situations the access to your workloads may break. In such
cases, you should have alternate methods to access your environment by
using cross account roles.

Detection

Detection is a critical part of your security readiness as well as action plan.


You need both proactive and reactive strategies for security incident
detection. Your approach to address detection mechanism should include
both configuration as well as investigation. Let’s look at the details of each of
the approaches.
Configure
1. Configure service and application logging
You can use Services such as AWS CloudTrail, AWS Config, Amazon
GuardDuty to record and detect, event history, resource configuration
changes and malicious activities across you accounts. For your application
traffic you can use VPC flow logs to analyse patterns and anomalies in the
network traffic going within as well as in and out of your VPC. For OS or
application specific logs you can store CloudWatch agent on your instances
and export logs to CloudWatch logs.
2. Analyse logs, findings, and metrics centrally

Apart from collecting all the logs, you should have mechanisms in place to
analyse and identify meaningful information from these to set up
benchmarking of good versus bad logs. AWS Services such as GuardDuty
and Security Hub and aggregate, duplicate and analyse logs received from
other services. These two services alone can help you make sense of the logs
received from VPC flow logs, VPC DNS service, CloudTrail, Inspector,
Macie and Firewall manager. This ensures that you get an overall view of
what is happening in your AWS environment and allows you to route,
escalate, and manage events or findings.

Investigate
1. Implement actionable security events

Your log configuration and analysis are only useful if you have an action plan
or runbook for each type of findings or event. You need to have a
documentation about each type of finding and update it continuously with a
runbook when a particular event occurs.
2. Automate response to events

Based on pre-defined or custom pattern of events from various sources, you


can use EventBridge to trigger automated workflows which can either be
AWS services or third-party service integration. Similarly, AWS Config rules
can automatically help you enforce your compliance policy by detecting and
parsing change in the system.
Infrastructure Protection

Infrastructure protection ensures that the services and systems used in your
workload are protected against undetected and unauthorized access, and
potential vulnerabilities. You can achieve this by protecting networks and
protecting compute.
Protecting Networks
1. Create network layers

You should have segregation of resources into public and private networks.
Only the resources such as internet-facing load balancers should be in public
subnets and rest of the resources like webservers, RDS or even managed
services like Lambda should use private subnet to prevent unauthorized
access attempts. For large environments with multiple accounts and VPCs,
you can use resources like Transit Gateway to for the inter-VPC and edge
networking. This ensures that your resources are not exposed to the internet.
2. Control traffic at all layers

Extending the previous point, depending on where you place your resources,
you should have control measures in place to allow only specific type of
traffic. You can achieve this using Network ACLs (NACLs) at the subnet
level and security groups at the instance level. Further you can restrict
unnecessary public level from your resources by the means of using VPC
endpoints, which allow your instances to access Public AWS services using a
secure private channel from the VPC.
3. Implement inspection and protection

For the resources which need public access, you should inspect all type of
traffic which tries to connect to your resources and have rules in place to
protect them from common attacks. This can be done using AWS WAF and
Shield advanced which can be used in front of your workload resources.
4. Automate network protection

Threats do not wait for the specific time when you are online and they may
impose risk to your resources 24*7. So, you should be prepared to
automatically append your security measures based on new sources and
threats as well. AWS WAF managed rules along with WAF security
automations solutions can help you dynamically block certain traffic based
on new and known threats.
Protecting Compute
1. Perform vulnerability management

Security vulnerability assessments should be part of your development,


deployment pipelines and for static production systems. This ensures your
code and infrastructure is protected against latest threats. Amazon Inspector
can automatically check the common and recent CVEs to which your
resources may be exposed. Additionally, you can make use of other methods
such as Fuzzing to inject malformed data into your application and test bugs.

2. Reduce attack surface

You can reduce the attack surface by removing unused components in your
operating systems be in software packages or applications. The components
which you need should follow the best practises for hardening and security
guides.
3. Enable people to perform actions at a distance
In order to reduce the risk of human error, you should try to avoid direct
access to the systems as resources as much as possible. This includes SSH,
RDP and AWS Management console access. As an alternative you can use
AWS Systems Managers to run commands in your EC2 instances or use
CloudFormation templates via pipelines to make changes to your
infrastructure environment.
4. Implement managed services

Instead of hosting your own resources for database or containers, you can
make use of Managed services such take care of the provisioning, availability
as well as security patching of the resources. By using services such as
Amazon RDS, ECS or Lambda, you can get AWS to look after the security
aspects of the infrastructure and focus on better things.
5. Validate software integrity

As simple as it may sound, you should ensure that any external code or
packages used in your workloads should be from trusted sources and you
should check that they have not been tempered with by verifying checksums
and certificates from the original author.
6. Automate compute protection

All the compute protection practices mentioned in this section should


be automated to both save time as well as reduce human error.

Data protection
Data protection can be categorized into the following 3 categories described
in each of the sections.
Data classification

1. Identify the data within your workload

By determining the scope and sensitivity of your data processes by your


workloads, you should classify the data based on sensitivity levels. All
security measures for the data protection should be put in place based on the
classification.
2. Define data protection controls

Once the data has been classified, you can separate them by tagging them
based on classification and use different accounts for sensitive data. The
access to such data can also be controlled based on tag condition on the
resources in the IAM policies.
3. Define data lifecycle management

Based on legal and business requirements, you should determine the


duration for which you need to retain the data. Similarly, the data
destruction and access policies should keep the sensitivity of the data
in mind.
4. Automate identification and classification

As you constant add consumable data, you need to have mechanisms


in place to automatically identify and classify it. Amazon Macie
which uses uses Machine learning can automatically discover,
classify, and protect sensitive data in AWS.
Protecting data at rest
Data protection at rest aims to secure inactive data stored on your block
devices, object storage or databases. Some of the best practises for robust
data protection at rest include:
1. Implement secure key management

Encryption of your data at rest can make it unreadable without access


to secret keys. However, it is equally important to store, rotate and
manage access control of these secret keys. With the help of services
like KMS and CloudHSM you can store and manage encryption keys
and further integrate them with different AWS services for data
access.

2. Enforce encryption at rest

You can configure settings on your resources such as S3 or EC2, which


would ensure that the data store is encrypted by default. You can also prevent
upload of unencrypted objects to S3 by using custom bucket policies.
3. Enforce access control

You can audit the access to your data at rest by using CloudTrail as well as
service level logs such as S3 or VPC flow logs. By analyse the access details
you can put measures in place to prevent unnecessary access as well as
reduce public access as much as possible.
4. Audit the use of encryption keys

If you use KMS to store your encryption keys, then use can analyse the API
calls in CloudTrail to analyse the use of the keys over time and determine of
the usage follows what you intend to implement for your access control.
5. Use mechanisms to keep people away from data

Instead of giving direct access to data for your users, you should make of
services such as Systems Manager to access the EC2 instances and data in
them. Moreover, if instead of handling out raw data directly to the users, you
should share reports to them for sensitive information.
6. Automate data at rest protection

By enabling AWS Config compliance rules for encryption of data for


resources such as EBS, you can ensure automatic data protection at rest in in
all source resources.
Protecting data in transit
Data in transit is any data that is sent from one system to another and in order
to ensure the confidentiality and integrity of the data, it is important to protect
it in transit.
1. Implement secure key and certificate management

For all kinds of HTTP based access, you need to enforce encrypted data
transfer by using HTTPS and for that you can make use of AWS Certificate
manager with supported services such as ALBs, CloudFront and API
Gateways.
2. Enforce encryption in transit

You should have rules in place which redirect user request from
HTTP to HTTPS to encrypt the use session with your application. On
top of it, you can also have rules and log analysis in place which
check and block insecure protocols like HTTP. For data transfer
between on-prem and AWS, you can configure VPN to encrypt the
data in transit.
3. Authenticate network communications
By using TLS with public endpoints and IPSec with AWS VPN
service you can ensure that there is a mutual authentication in place
between the two parties that intend to establish network
communication.
4. Automate detection of unintended data access

Services such as GuardDuty can analyse log data received from VPC
flow logs, VPC DNS service etc and can help determine malicious
traffic in your environment. Similarly, you can use S3 log analyser to
determine of there are any unintended data access and update your
bucket policies accordingly.

Incidence Response

Despite of all the security measures in place, everything breaks all the time.
Your organization should have process in place which help you respond to
and mitigate the potential impact of security incidents. There are a number of
different approaches you can use when addressing incident response.
Educate
Educating your workforce about your cloud infrastructure and security
measures in place can help them in preparing for incidents in advance. By
investing in people and getting them to develop automation and process
development skills, your organization can benefit in a long run.
Prepare
1. Identify key personnel and external resources

As one of the first steps to prepare for the incidence response situations is
identifications of the service owners internally as well as external resources
such as third-party vendors, AWS Support etc. Your team should have the
details of all the resources well in advance to reduce the response time.
2. Develop incident management plans

Your incidence management plan should be created in form of playbooks


with action plans for most likely scenarios for your workload and
organization. You can test this action plans with simulations and then further
iterate them so that they meet production requirements.
3. Pre-provision access

Your incident responders, who may be part of your on-call teams or


operation's centre should have sufficient access to be able to investigate
quickly and reduce the time during a security event. These credentials should
be restricted to only specific personnel and should not be used for day to day
activities.
4. Pre-deploy tools

Any tools which are needed for investigation purpose, should be pre-
deployment in your infrastructure. This reduces the time to begin
investigation during an event. One of the examples is a package like
tcpdump, if there is a security event and your responder needs to take a
packet capture, it is useful to have the tcpdump tool installed well in advance
on your instances.
5. Prepare forensic capabilities

It is a good idea to have a dedicate system with tools which can enable you to
analyse disk images, file systems and any other artifact that may be involved
in a security event.
Simulate
Run Game Days: Using Game days, you gather all your teams together and
dedicate them to work on simulated real-life scenarios to test your incident
management plans. When you break things in a controlled environment and
some plans do not work, it’s a good learning for teams to re-iterate on their
playbooks and prepared for the real incidents.
Iterate
Automate containment and recovery capability: With event use of your
playbook, you can codify the repeated tasks to further improve response time.
Once your code logic seems to work, you can integrate that directly to the
various event sources so that human interaction is minimum in the incidence
response.

Review based on Security Pillar


Some of the questions which you will be going through when you review
your workload against the security pillar of the Well-Architected Framework
are:
SEC 1: How do you securely operate your workload?
SEC 2: How do you manage identities for people and machines?
SEC 3: How do you manage permissions for people and machines?
SEC 4: How do you detect and investigate security events?
SEC 5: How do you protect your network resources?
SEC 6: How do you protect your compute resources?
SEC 7: How do you classify your data?
SEC 8: How do you protect your data at rest?
SEC 9: How do you protect your data in transit?
SEC 10: How do you anticipate, respond to, and recover from
incidents?
The answers to these questions during the review help you identify any gaps
in your existing workloads and implement the best practices in your AWS
environment.

Further Reading
https://d1.awsstatic.com/whitepapers/architecture/AWS-Security-Pillar.pdf
Ace the
Reliability pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of
pages of the Well-Architected framework, this guide serves as an
abstract of the reliability pillar of the Well-Architected framework.
Understanding the reliability pillar, will ensure that you are equipped
with the knowledge of the best practices in order to make your
workload perform its intended function correctly and consistently when
it is intended to.

The Reliability Pillar

As AWS CTO Werner Vogel says "Everything fails, all the time", your
workload will also go through all sorts of tests be in spike in load or failure of
more than one component at a time. Due to this it becomes important to keep
reliability in mind when designing your architectures and evolving it over
time. This paper provides in-depth, best practice guidance for implementing
reliable workloads on AWS.
As any of the pillar of the Well-Architected framework, the Reliability pillar
also covers the following two broad areas:
Design Principals
Definitions

Design Principals

Following the various design principals highlighted in this pillar can help you
achieve reliable workloads in the cloud:
1. Automatically recover from failure

You should identify Key performance indicators as a measure of


business value and monitor your workload against them so that an
automatic action can be triggered when threshold is reached.
2. Test recovery procedures

Apart from testing your workloads for normal functionality, you


should also test how they recover against failures. With the cloud this
can be made easy by automation to identify various failing
components and making sure the recovery paths trigger in to maintain
consistent outputs.
3. Scale horizontally to increase aggregate workload availability

Your heavily scaled up resources can lead to too much dependency


and many times may be a single point of failure. By scaling
horizontally your requests will be distributed across multiple smaller
resources thus making it a more resilient architecture.
4. Stop guessing capacity

In on-premises world as the demand increases the workload may not


be able to sustain leading to failure as there is very less scope of agile
provisioning. With cloud-based workloads you can easily monitor
various metrics and setup scaling policy to add/remove capacity on-
demand without having to worry about the turbulent times of traffic
spikes.
5. Manage change in automation

By making changes to your infrastructure through automation you can


track and review them. This also sets up operational procedures to
automate infrastructure changes in the time of need.

Definitions

AWS outlined four focus areas that encompass reliability in the cloud:
Foundations
Workload Architecture
Change Management
Failure Management

Let’s go into the details of each of them to understand the best practices in
these areas to achieve reliability in your cloud-based workloads.

Foundations
To achieve reliability, you must start with the foundations—an environment
where service quotas and network topology accommodate the workload.
These foundational requirements extend beyond a single workload or project
and influence the overall reliability of the system.

1. Manage Service Quotas and Constraints:

Various service limits or quotas exist in the cloud to prevent accidental


provisioning of more resources than you need and to limit requests rate to
API operations so as to protect services from abuse. Also, physical resource
constraints such as maximum available bandwidth on a fibre link and storage
space on a physical hard-drive exist. As such this is a very crucial
foundational requirement to be aware of such limits.
Aware of service quotas and constraints:

Various service limits such as EC2, EBS and EIPs and VPC
quotas exist in the AWS cloud and you should be aware of such
limits by referring to Service Quotas. Using this service, you can
manage up to 100 service limits from a single location.
Additionally, Trusted Advisor checks includes your service quota
usage to take necessary action.
Manage quotas across accounts and regions:
With environments spread across multiple accounts, you need to
keep track of service quotas across all the accounts since service
limits are set on per account and in most cases per region basis as
well. Using AWS Organizations, you can automate updating the
service limits of newly created accounts with the help of pre-
defined template to maintain a uniform structure across the
organization.
Accommodate fixed service quotas and constraints through
architecture

Some of the limits such as network bandwidth over your DX


connection or AWS Lambda payload size are fixed. Your
architecture should be designed considering these unchangeable
service quotas and physical resource limits.
Monitor and manage quotas:

You can monitor your utilization and keep track of threshold


breaches by setting up appropriate CloudWatch alarms based on
the service quota metrics for the corresponding services.
Automate quota management

You can also integrate your change management or ticketing


system with Service Quotas to generate limit increase requests
automatically based on the threshold breaches.
Ensure that a sufficient gap exists between the current quotas
and the maximum usage to accommodate failover

Your failed resources are also considered against your quota until
they are terminated. Considering this you should consider at least
an AZ level failure of resources to calculate the gap.

2. Plan your network topology:

While architecture for highly reliable environments you should plan for intra-
system and intersystem connectivity between various networks, public and
private IP management and DNS resolution.
Use highly available network connectivity for your workload
public endpoints

For your workload which is publicly accessible, you need to a


highly available routing. Services such as Route53 for DNS,
Amazon CloudFront for CDNs, AWS Global accelerator for
anycast routing and ELB for HTTP or TCP termination can help
you get a managed and highly available endpoint for your end user
connectivity.
You should also plan for mitigation of external attacks as public
endpoints are at more risk of exposure than any other resource.
AWS Shield can provide you automatic protection against these
kinds of attacks and allow only legitimate traffic to pass through.
Provision redundant connectivity between private networks in
the cloud and on-premises environments

For private connectivity to your on-prem network which relies on


services such as Direct Connect via physical fibre or AWSVPN
via secure tunnel over the internet, you need to plan for failure by
provisioning appropriate redundancy. For Direct Connect
connections, you should plan for router, fibre, provider and POP
level failure by selecting redundant paths. If cost is a constraint
then you can plan for a secondary VPN connection as a backup of
your primary Direct connect connectivity, however you should be
aware of the difference in performance over the two channels.
Ensure IP subnet allocation accounts for expansion and
availability

Your VPC CIDR range and subnet size cannot be changed once
created, so you need to plan in advance to allocate enough IP
block to accommodate your workload requirements. Additionally,
you need to leave some room for future expansion by keeping
unused space in each subnet.
Prefer hub-and-spoke topologies over many-to-many mesh

If you have only two different networks then you can simply
connect them using 1:1 channel such as VPC peering, DX or AWS
VPN, but as the number of networks grows, the complexity of
meshed connections between all of them becomes untenable.
AWS Transit Gateway can help you maintain a hub and spoke
model, allowing you to route traffic across your multiple
networks.

Enforce non-overlapping private IP address ranges in all


private address spaces where they are connected

You cannot natively connect two VPCs or your VPC and on-prem
network if they have overlapping IP address ranges. In order to
avoid such IP conflicts, you should use IP address management
(IPAM) systems to manage allocating of IP address ranges for
various resources.

Workload Architecture

You need to make upfront design decisions to create a reliable workload.


This includes both software and infrastructure components. Following
patterns can be referred to for reliability:
1. Design Your Workload Service Architecture

Build highly scalable and reliable workloads using a service-oriented


architecture (SOA) or a microservices architecture. Service-oriented
architecture (SOA) is the practice of making software components
reusable via service interfaces. Microservices architecture goes further
to make components smaller and simpler.
Choose how to segment your workload

You should avoid monolithic architectures wherever possible.


Instead, you should either choose between Service oriented
architectures (SOA) or microservices. With the shared release
pipelines, rigid scaling a high impact of changes, monolithic
architectures make it hard to adopt new technologies. With the
smaller segmentation offered by SOA or microservices you get
greater agility, organizational flexibility and scalability.
Build services focused on specific business domains and
functionality

Your services build using SOA and microservices should have a


very specific task to do. This helps you understand the reliability
requirements of different components. A Domain-Driven Design
(DDD) model can be used in microservices to model business
problems using entities. For example, in a Dating App, the entities
can be user sign up, verification, login, profile management,
matches and payments can be different entities. Then you can
further identify the entities which share common features to group
them using bounded context. In this case it would be the sign up,
verification and login as a part of a bounded context. Using this
context, you can then identify the boundary of the resulting
microservices architecture.
Provide service contracts per API
Each service created using microservices can be exposed using an
API which has a very specific task to accomplish. Various teams
in your organization can make use of this API and agree upon a set
of expectations such as rate limits, performance and API
definitions. With such contract in place, owing team can configure
their API to meet this requirement so that the overall workload
outcomes are consistent.
2. Design Interactions in a Distributed System to Prevent
Failures

Distributed systems rely on communications networks to


interconnect components, such as servers or services.
Your workload must operate reliably despite data loss or latency in these
networks. Components of the distributed system must operate in a way
that does not negatively impact other components or the workload.
Identify which kind of distributed system is required

There are two main kinds of distributed systems


1. Hard real-time distributed systems: which require
responses to be provided as soon as the request is
received. Even though the requests can be un-
predictable your system should be able to respond to it
rapidly.

2. Soft real-time distributed system: which have a more


generous time window of minutes of more for
response.

Based on the business requirements, your workload should be


choosing the system and accordingly set up the reliability metrics
for that.
Implement loosely coupled dependencies

Loosely coupled architectures help isolate behaviour of a


component from other components that depend on it, increasing
resiliency and agility. On the other hand, in tightly coupled
architectures changes to a component force other component that
rely on it to change as well. For example, if your client directly
communicates with an EC2 instance then the failure on the EC2
directly impacts the client as it behaves as a tightly coupled
synchronous environment. By using a load balancer to send
requests to various EC2 instances, a failure on a single EC2
instance will not affect your client. Similarly, asynchronous
workflows with the help of Amazon SQS will enable your
independent worker nodes to process messages in case of failure
of one of them.
Make all responses idempotent

Each request received by your service should be served exactly


once, so if you have multiple requests which have the same effect,
then it is possible that they can be erroneously processed multiple
times. By using idempotency tokens in your APIs, you can
guarantee that even if your service receives a request more than
once, it is not going to create duplicate records or errors.
Do constant work

Sudden large or rapid changes in your systems can lead to your


service getting overwhelmed, instead it should be designed to
work interact in the same as it would in case of a large failure.
This should be specially applicable for health check systems and
they should constantly work with the same payload, even if no
server is failing or all are failing.
3. Design Interactions in a Distributed System to Mitigate or
Withstand Failures

While the advantages of having a distributed system far outweigh its


challenge, your reliable systems should be designed considering the
failures which can occur in the distributed systems. One of the most
common is the communication between different systems. Your workload
should continue to operate reliabily despite data loss or latency between
the network. A few best practices to prevent such failures are:
Implement graceful degradation to transform applicable hard
dependencies into soft dependencies

When multiple systems are dependent on each other and one of


them fails, instead of cascading failures across all components, the
immediate system connected to the failed system should be able to
provide a fixed response in order to avoid passing the failure to
other dependent components.
Throttle requests

Your systems should be designed to know the capacity which they


can serve requests and higher rate of requests should be throttled
by rejecting them and sending a response indicating that throttling
has occurred. This indicates the client to back off and try with a
slower rate instead of contantly pushing your system to process
requests above its capacity. Services such as API Gateway have
methods to throttle requests and similarly Amazon SQS and
Kinesis can buffers requests as well.
Control and limit retry calls

In case of failures, your client should not just constantly retry, but
delay the requests with progressively longer intervals. This is
known as exponential back-off which increase the interval
between subsequent retries.
Fail fast and limit queues

For server side of things, if your workload is unable to respond


quickly then you should fast fail instead of keeping the resources
reserved. Similarly, for high number of requests the requests
should be buffered but the queue size should not be too long
which may result in stale requests which the client has already
timed out.

Set client timeouts

Your client-side timeouts should be determined after verifying the


end-to-end structure and you should not rely on the default values
which in some cases can be too high. This avoids unnecessary
retries thereby avoiding system overload.
Make services stateless where possible

Your services should not store state in memory or in the local disk,
this enabled client requests to be sent to any compute server
without dependency on an existing one in case of failure. For
example, you may a webservice that serves traffic behind a load
balancer and your initial client session is established with
webserver-1 which also maintains the session state. In this case if
the webserver-1 goes down, the webserver-2 will not have any
information about the client state. In such situations, the session
state should be offloaded and maintained in services such as
ElastiCache. For serverless architectures, this can be done using
DynamoDB.
Implement emergency levers

Emergency levers can be used to mitigate the availability impact


on your workload. This can be as simple as serving static response
or blocking a specific type of traffic entirely. This ensures that the
entire workload availability is not affected and you can focus on
determining the root cause of the issue affecting availability.

Change Management
Your workload should be designed to adapt to changes, which can be either
due to spike in demand or feature deployments. Here the best practices for
change management:
1. Monitor Workload Resources
Monitoring your workloads enables it to recognize when low-
performance thresholds are cross or failures occur, so that it can
recover automatically in response. There are 4 distinct phases in
monitoring:
Generation

You should monitor all the components of your workload,


which may be using AWS services or customer services.
While AWS services can be monitored using CloudWatch
metrics, your custom workloads can be monitored using third
party tools to either integrate with CloudWatch or use their
custom solution.
Aggregation

Similar to metrics, logs from all the components should be


stored either in CloudWatch or S3 and then define metrics
based on specific filtered applied to the logs. For example, in a
webserver logs, you can apply filter to monitor HTTP 5XX
error which may not be determined using pre-defined metrics.
Real-time processing and alarming

Your metrics should be processed in real-time by first


notifying appropriate subscribers by using Simple Notification
Service (SNS). These subscribers can either be your operations
team or other AWS services to create take appropriate action.
The actions taken against the breach of a metric threshold
should be automated in nature by triggering autoscaling events
or implementing custom logic using Lambda functions.
Storage and Analytics

Apart from defining custom metrics from your log data, you
should also analyse the log files for broader trends and getting
insights into your workload. Amazon CloudWatch Logs
Insights comes with both pre-defined queries and ability to
create custom queries to determine trends your workload.
All the above phases should be reviewed periodically to implement
changes as the business priorities change. You can use additional
auditing mechanisms using CloudTrail and AWS config to identify
when and who invoked an AWS API or made an infrastructure
change.
2. Design your Workload to Adapt to Changes in Demand

Your workload should provide elasticity so that it can scale in order to


meet the demand at any given point of time.
Use automation when obtaining or scaling resources

For your EC2 resources you can use autoscaling to scale out or
in according to demand and trigger these actions by monitoring
specific workload metrics. Usually with the Amazon managed
services such as S3 or Lambda, the scaling activities can be
taken care by service itself and it automatically scales up to
meet the demand.
Configure and use Amazon CloudFront or a trusted
content delivery network

Your content delivery network services such as CloudFront


can serve static content to your end users and which not only
reduces load on your computer resources but also provides an
optimal experience to the clients.
Obtain resources upon detection of impairment to a
workload

Instead of just monitoring your resources metrics you should


monitor the overall workload metric to detect failures by using
health checks. These resources should be reactively scaled by
using automatic or manual mechanisms.
Obtain resources upon detection that more resources are
needed for a workload

You should configure the scaling criteria based on workload


specific parameters such as request rate, time to process each
request and request/response size. Based on these parameters
your scaling criteria should be decided and also modified if
there is change in parameters.
Load test your workload

You should implement a load testing strategy to determine of


the scaling activities can keep up with the demand. With the
Cloud, it is easy to create new resources and terminate them
after the testing is done, so the cost of testing is a fraction of
what you would usually pay in the on-premises environment.
3. Implement Change

Use runbooks for standard activities such as deployment

By using runbook for standard changes, you can be aware of


the next step to be taken in case something goes wrong. These
runbooks can be created for both manual and automatic
changes and reduce ambiguity in your change management
process.
Integrate functional testing as part of your deployment

Your functional tests should be one of the most important tests


as a part of deployment. These should be done in pre-prod as
well as prod environment and any failure should result in
rollback of the changes.
Integrate resiliency testing as part of your deployment

Your deployments should include resiliency tests in order to


ensure that your workload continues to function in case of
failure of a component.
Deploy using immutable infrastructure

Your code changes or updates should not be deployed directly


to the production systems. Instead, you should spin up new
resources to implement these changes and move the traffic to
the new resource. The shift of traffic can be done either using
Canary deployment, where a small number of customers are
directed to the new version or by using Blue/Green
deployment, where a full fleet of application is deployed in
parallel and a percentage of customers are directed to the new
group of resource.
Deploy changes with automation

You should make use of automation by implementing CI/CD


pipelines through AWS CodeDeploy which automatically
deploy application code to EC2, Lambda, ECS and on-
premises network. This ensures a consistent approach every
time you want to deploy changes to your workload.

Failure Management
Your workload should be able to withstand failures at any level. This can be
done by using the following strategies:
1 Back
up
Data
Identify and back up all data that needs to be backed up, or
reproduce the data from sources

You should identify all sources of data which if lost can affect
your workload outcomes. This data in turn should be backed up
depending on the resource. Amazon S3 can be one of the most
versatile backup destination and part inbuilt backup capabilities in
services such as EBS, RDS or DynamoDB.
Secure and encrypt backup

Access to the backup of your data should be secure in the same


way you would do to the original data. You can use IAM to
restrict access to the backup data and also encrypt it using AWS
KMS.
Perform data backup automatically

You should determine a backup schedule and run the backup


process automatically. AWS Backup service can provide you a
centralized view of all your backup schedules across multiple
AWS services.
Perform periodic recovery of the data to verify backup
integrity and processes

Your backup should meet the Recovery time objective and


Recovery point objective and this should be tested by performing
a recovery test. You can setup a test environment in which you can
restore backup data and test for its integrity and content to meet
your goals.
2 Use Fault
Isolation
to Protect
Your
Workload
By setting up fault isolated boundaries you can limit the effect of a
failure to a limited number of workloads. A few best practices around
this are:
Deploy the workload to multiple locations

Your workload should be employed across multiple locations and


this should start with spreading the components in different
Availability zones within an AWS region. This ensures that even
if there are component related or AWS AZ related failures in a
specific availability load, you can continue to serve the traffic
using the other AZs. If there are extreme availability requirements
or other business goals, then you can also follow a multi-region
architecture. Additional locations where you can deploy your
workload are AWS Local zones, Edge locations such as
CloudFront or Global accelerator and even the on-premises
network.
Automate recovery for components constrained to a single
location

If there are workloads which are restricted to running a single


availability zone or on-premises network, then you should
implement automatic recovery either by provisioning redundant
components or using backup solutions. This is specifically
application for High Performance compute (HPC) workloads or
with services such as Amazon EMR or Redshift.
Use bulkhead architectures to limit scope of impact

Bulkheads in ship are the partitions to divide the ship into different
compartments so that even if one part was damaged, the rest
remained intact. Similar practice and be used by implementing
data partition or using cells for your services. Your customer
requests should be routed to different cells based on shuffle
sharding which ensures that only a limited number of customers
are impacted in case of an event.
3 Design
your
Workload
to
Withstand
Component
Failures
Your workloads should be architected for resiliency and they should be
able to continue running in case of component failures. Some of the ways
in which you can achieve this are:

Monitor all components of the workload to detect failures

You should monitor both technical failures of workload


components as well as business metrics from running the
workload that gives you an overall status of how your workload is
helping in achieving the business outcomes.

Failover to healthy resources

By using health check mechanism in Elastic load balancing or


Route53, you can monitor the health of the workload in specific
location and failover to the healthy resources by the means of load
balancing changes or DNS record update.
Automate healing on all layers

The ability to restart is an important recovery mechanism and it is


application at all layers of your workload. For simple component
failures, by making the services stateless, you can replace the
entire component such as EC2 as a part of restart. For large scale
replacements such as availability zone failures, it is generally
better to preferred an alternate availability zone as preferred one
till stability is achieved instead of provisioning multiple resources
at once in the impacted AZ.
Send notifications when events impact availability

Even if an issue caused by an event is automatically resolved, you


should configure notifications to identify underlying problems that
need to be investigated and avoid reoccurrence.
4 Test
Reliability
Use playbooks to investigate failures
Perform post-incident analysis:

You should review any customer impacting events to identify the


contributing factors and make changes to take preventive actions
for the future. You can also add any additional tests to catch
failures before they occur in real time based on the findings from
the analysis.
Test functional requirements

By running synthetic testing such as canary testing, you can


simulate the customer experience and understand if the workload
meets its functional requirements. CloudWatch synthetics can help
you create canaries to monitor your customer facing endpoints and
APIs to monitor these.
Test scaling and performance requirements

In the cloud you can provision and terminate production-scale test


environments and run load tests to ensure that they scale up to
meet the performance outcomes with demand. Before load testing
you need to make sure that your scaling settings, service quotas
and base resources are configured to run as expected under load.
Test resiliency using chaos engineering

Chaos engineering is a way to inject failures into your pre-


production and production environments. There are many open-
source tools such as Netflix Chaos Monkey which can help you
inject failures and understand how your workload reacts to
different failures.
Conduct game days regularly

Using game days, you can exercise your procedures for


responding to events and failures as close to production as
possible. These will help you understand where improvements can
be made and give your organization and experience to deal with
production impacting events.
5 Plan for
Disaster
Recovery
(DR)

Based on your business needs you should implement a disaster


recovery strategy based on the locations and function of your
workload and data. Some of the key things to keep in mind are:
Define recovery objectives for downtime and data loss

Your workload should have a recovery time objective (RTO) and


recovery point objective (RPO). You should identify the RTO to
set the maximum acceptable delay between interruption of service
and the restoration of service. The RPO should be defined
considering the maximum amount of time since the last data
recovery point.
Use defined recovery strategies to meet the recovery
objectives

Based on your RTO and RPO, you can use one of the following
strategies which have different complexities and order of
RTO/RPO:
a) Backup and
restore
(RPO in
hours and
RTO in 24
hours or
less): You
can back up
your data
and
applications
using point
using point
in time
backups in a
DR region
and restore
it to recover
from
disaster.
b) Pilot light
(RPO in
minutes and
RTO in
hours) :
You can
actively
replicate
your data
and
workload
architecture
to a DR
region. In
this case
while your
data is most
up to date,
the resource
will be
switched off
and only
started
when the
DR failure
is invoked.
c) Warm
standby
(RPO in
seconds and
RTO in
RTO in
minutes):
You can
have a fully
functional
version of
your
workload
running in a
DR region,
which
includes
both active
components
as well as
data.
However,
the
components
will be
running a
scaled down
version to
minimize
the cost of
running the
workload.
d) Multi-
region
active/active
(RPO near
zero and
RPO
potentially
zero) : In
this case
you can
have same
have same
copies of
your
workload
running in
two
different
regions,
same as
what you
would do in
different
availability
zones
within a
region.
Since the
workload is
up to date
across
different
regions, you
have the
minimum
downtime in
case of
failure of a
particular
region.

Test disaster recovery implementation to validate the


implementation:

Your DR strategy should be tested to ensure it continues to


achieve the business outcomes based on different recovery paths.
These recovery paths can be used to tests the scaling, latest data
availability and overall architectural functionality in the DR site.
Manage configuration drift at the DR site or region

You should ensure that the infrastructure, data and configuration


as needed in DR site or region. This includes having the latest
copies of AMIs and service quotes as you would have in the active
region.
Automate recovery

Based on health checks, monitoring business metrics and third-


party tools, you can automate the system recovery and route traffic
to DR site or region.

Review based on Cost Optimization Pillar


Some of the questions which you will be going through when you review
your workload against the reliability pillar of the Well-Architected
Framework are:
REL 1: How do you manage service quotas and constraints?
REL 2: How do you plan your network topology?
REL 3: How do you design your workload service architecture?
REL 4: How do you design interactions in a distributed system to
prevent failures?
REL 5: How do you design interactions in a distributed system to
mitigate or withstand failures?
REL 6: How do you monitor workload resources?
REL 7: How do you design your workload to adapt to changes in
demand?
REL 8: How do you implement change?
REL 9: How do you back up data?
REL 10: How do you use fault isolation to protect your workload?
REL 11: How do you design your workload to withstand component
failures?
REL 12: How do you test reliability?
REL 13: How do you plan for disaster recovery (DR)?
The answers to these questions during the review help you identify any gaps
in your existing workloads and implement the best practices in your AWS
environment.

Further Reading
https://docs.aws.amazon.com/wellarchitected/latest/reliability-
pillar/welcome.html
Ace the
Performance Efficiency pillar

AWS Well-Architected Framework


Introduction
For a busy person who may not have a time to go through the 100s of
pages of the Well-Architected framework, this guide serves as an
abstract of the performance efficiency pillar of the Well-Architected
framework. Understanding the performance efficiency pillar, will
ensure that you are equipped with the knowledge of the best practices
in order to make your workload run efficiently in the cloud.

The Performance Efficiency Pillar

The Performance Efficiency pillar includes the ability to use computing


resources efficiently to meet system requirements, and to maintain that
efficiency as demand changes and technologies evolve. The key topics
includes focus on selecting the right resource types and workload based on
requirements, monitoring performance and making informed decisions to
maintain efficiency as business needs grow.

Components
As any of the pillar of the Well-Architected framework, the Performance
efficiency pillar also covers the following two broad areas:
Design Principals
Definitions

Design Principals

Following the various design principals highlighted in this pillar can help you
achieve and maintain efficient workloads in the cloud.
1. Democratize advanced technologies

You should make it easier for your teams to adopt and implement new
and complex technologies rather than presenting them with a huge
learning cliff. This can be done by consuming the complex
technologies as a service in the cloud. For example, using Amazon
SQS as your message queuing service for your workload may take off
the heavy lifting of setting up of your own infrastructure for
messaging system and the teams can focus on product development.
2. Go global in minutes

Your architecture should be design in a way that you can deploy your
workload across the multiple AWS regions, which help you reach a
wider audience and benefit the end users with lower latency.
3. Use serverless architectures
Instead, if using traditional servers, you can make use of the serverless
architectures to run your code. This not only removes the burden of
provisioning and maintaining the servers, but also ensure that the
managed services scale according to your workload.
4. Experiment more often

With benefits such as infrastructure as a code and on-demand


provisioning of your resources you can test more often to evaluate
different type of resources and implement changes to your workload
more often than traditional systems.
5. Consider mechanical sympathy

You need to understand how a particular system operates best


according to your business goals or workloads outcomes. There is no
one size fits all option with cloud and you have the option to select the
best resource to select according to your customer access needs.

Definitions

AWS outlined four focus areas that encompass performance efficiency in the
cloud:
Selection
Review
Monitoring
Trade-offs
Let’s go into the details of each of them to understand them better and how
we can use them to create an efficient and sustainable workload in the cloud.

Selection

It is important to select the right virtual environment for your infrastructure


instances, containers, functions and elasticity. The exercise should start
during the design phase itself as it sets the tone for how your resources are
able to sustain production workload requirements. The following set of
considerations should be kept in mind.
1. Performance architecture selection

Understand the available services and resources: Rather than just


trying to replicate what your resources on-premises workload
uses, you should learn and understand about all the available
services and resources in the cloud.

Define a process for architectural choices: Your architectural


decisions you be based on processes which consider the best
practices, reference architectures, past lessons learnt and your
workload specific requirements.
Factor cost requirements into decisions: You should factor in
selecting cost effective resources when making decisions about
your architecture. Options such as AWS Managed services can
help reduce the operational overhead and turn out cost effective
in longer run.

Use policies or reference architectures: You can refer to your


internal policies for running your workload and do analysis to
improve your architecture for optimal performance.

Use guidance from your cloud provider or an appropriate partner:


AWS solutions architects and Partners have tons of experience
from various customers to help you optimize your environment.
You can seek their expertise to help you unlock potential for your
workloads.

Benchmark existing workloads: You can benchmark your


existing workload in the on-premises network and use the data
collected to drive architectural decisions.

Load test your workload: Once you have deployed resources, you
should load test the environment to see how the workload
performs under stress conditions. By using CloudWatch metrics
you can see the performance of various components and make
changes to meet the desired requirements.

2. Compute architecture selection

AWS provides three different forms of compute: instances, containers


and functions
Instances: Amazon EC2 instances are virtual servers in
cloud. Depending on the families and sizes, they offer
different compute, memory and storage capabilities. Make
decisions driven by data from your workload to select the
right instance type. You can further tune the operating
system to meet your requirements.
Containers: Containers are a form of operating system
virtualization which can be used to run microservices or
software processes isolated by each other. With AWS you
get an option to select either EC2 or Fargate launch for
container clusters. With Amazon EC2 launch type, you
control the installation of the EC2 instance by provisioning
them your VPC whereas with Fargate, it takes away the
burden of launching and provisioning of compute
resources in a serverless fashion. Once the compute
resources are decided, you need to choose your
orchestration platform. While Amazon ECS can be used to
run docker containers, Amazon EKS is used to consume
Kubernetes as a service. Both of them have their own use
cases and it may not just be a binary decision as both of
them can work together seamlessly. You should also
consider your application environment and your team
experience before choosing the final orchestration tool.

Functions: AWS Lambda functions can be used to run


your code in an abstracted environment. This is an
excellence way to run microservices by uploading you
code package and selecting the programming language,
memory requirements and permissions. You can use
Lambda with Amazon API Gateway to receive end user
requests by creating REST and HTTP APIs.

When you select your compute option you should also consider the
inputs such as GPUs, I/Os, memory versus compute intensive and
elasticity.
3. Storage architecture selection

An optimal storage solution selection depends on the following factors


Kind of access method: Block, File or Object
Access pattern: Sequential or random
Throughput required
Frequency of access: Online, offline archival
Frequency of update: WORM, Dynamic
Availability
Durability

You may also choose more than one storage type depending on your
workloads, for example S3 for the image storage accesses by your users
and Amazon EBS for storing the WordPress files taking care of your
dynamic website. Let’s review the four storage offerings by AWS.
Block Storage: Amazon EBS and EC2 instance store volumes
can be attached to your EC2 instances. They are accessible
from a single EC2 instance and ideal for latency sensitive
applications when the data is mostly accessed from the EC2
instances.

File Storage: Amazon EFS and FSX offer file storage systems
over industry standard as NAS and SMB accessible systems.
They can be accesses by multiple EC2 instances at the same
time and are suitable when a group of servers such as High-
Performance Compute (HPC) cluster need to access a share
file system.

Object storage: Amazon S3 provides large scale object storage


option which the highest level of availability and durability.
Being rightly called as the storage of the Internet, it can be
used to storage images, documents and media files for your
end users. You can further accelerate delivery of object stored
S3 by using Amazon CloudFront as a content delivery
network.

Archival: Amazon S3 infrequent access and Glacier offer


options to stored data which in not frequently accesses or used
for archival purposes. They offer storage at a very low cost
however come with a high retrieval time. If you need long
term backup solution, then you can consider Glacier as your
preferred choice of storage.
4. Database architecture selection

An optimal database solution selection depends on the requirements of


availability, consistency, partition tolerance, latency, durability,
scalability, and query capability. There is no single database which will
meet all the conditions, however there are databases designed for specific
as well as most common type of use cases. The following database types
and services are offered by AWS:
Relational: Amazon Aurora, RDS and Redshift can be used as
relational databases if your data has a well-defined structure
and there is a relationship identified between them. If your
usecase includes ACID compliance and strong data
consistency then you should choose relational database for
your workload,

Key-value: Key-value databases store data as a single


collection without any structure or relationship. Amazon
DynamoDB offers you’re a managed key-value database
solution which is typically used for high traffic web
applications, ecommerce sites and gaming applications.

In-memory: For read intensive applications such as product


queries, you can place an in-memory data store such as
ElastiCache in front of your Amazon RDS database to cache
data in memory and delivery microsecond latency.

Document: For storing content, catalogs and user-profile in a


json like structure you can use the DocumentDB which is a
managed document database.

Wide column: Amazon Keyspaces (for Apache Cassandra) is


a managed wide column database service is a type of NoSQL
database but the name and format of the columns can vary
from row to row in same table. Typical use cases include high
scale industrial apps for equipment management, fleet
management and route optimization.
Graph: If your use case is identifying relationships between
highly connected graph datasets at scale for example social
networking or fraud detection, then you can evaluate Amazon
Neptune as a graph database for your workload.

Ledger: If you are looking to maintain systems of records,


supply chain, registrations or banking transactions, then
Amazon Quantum Ledger Database (QLDB) offers
transparent, immutable and cryptographically verifiable
transaction logs owned by a central trusted authority.

Timeseries: For IOT, DevOps and industrial telemetry data


which uses timeseries databases, Amazon Timestream is a
database offering which you should evaluate.

It is recommended to use a separate database for different microservices


and record real-time database performance metrics to proactive be
notified of metrics such as slow queries or system induced latency.
5. Network architecture selection

There are a few things which you need to keep in mind before selecting
an optimal network selection.
a) Understand
how
networking
impacts
performance:
Depending
on the
workloads
and their
consumers
factors such
as latency
and
throughput
can lead to
negative or
positive
impact on
performance.
For the
High-
performance
compute
(HPC)
related
workloads,
you need to
keep the
resources in
the cluster as
close as
possible and
you should
place them
in a
placement
group in the
VPC while
taking
advantage of
the
Enhanced
networking
and Elastic
Network
Adaptors
(ENA) on
the EC2
instances.
From the
user
perspective,
perspective,
you should
consider
using Global
accelerator
or
CloudFront
for
minimizing
the latency
and
improving
the delivery
of the
content.
b) Extending
connectivity
to the on-
prem
network: If
your
workloads
have
dependency
on the on-
premises
network, then
depending on
the latency
and
throughput
requirements,
you should
either
consider a
dedicated
Direct
Connect
connection or
site to site
VPN
connectivity.

c) Evaluate
other
networking
features:
There are
cloud
specific
networking
features
which can
which can
help you
reduce costs
as well as
improve the
overall data
transfer
performance.
By making
use of
Gateway and
interface
VPC
endpoints,
you can get
reliable and
private
connectivity
to public
AWS
services such
as S3. This
also reduces
your overall
NAT
Gateway
data
processing
costs. For
Global
networks
you can
leverage the
Latency
based
routing
feature in
Route53 to
route
route
requests to
the closest
end point to
the user’s
location
based on
latency.

d) Choose
location
based on
network
requirements:
There are
applications
sitting in on-
premises
network
which need
to benefit
from the
cloud
offerings but
cannot afford
the latency or
have data
residency
requirements.
Under such
conditions
you can
evaluate
either Local
Zones,
Wavelength
or Outposts,
which take
the AWS
cloud closer
to your on-
prem
workloads
and offer a
unique
hybrid
experience.

Review

1. Review your workload

In order to adopt a data driven approach to architecture you should


implement a performance review process that considers the following:
Infrastructure as a code: Define your infrastructure as a code
(IaaC) using approaches such as CloudFormation templates or
third-party solutions like Terraform. The use of IaaC allows
you to uniformly allow best practices for your application
code and infrastructure and also helps you seamlessly
innovate at a rapid pace due to the ease of deployment
Deployment pipeline: Continuous Integration (CI) allows you
to continuously integrate code into a single shared and easy to
access repository. Continuous Delivery (CD) allows you to
take the code stored in the repository and continuously deliver
it to production. Together the CI/CD pipeline creates a fast
and effective process of deploying your infrastructure and
releasing your new features.

Well-defined metrics: You need to define the key performance


indicators (KPIs) of your workload which can be captured by
setting up metrics and monitoring them. The considerations
for defining KPIs should include technical aspects such as
time for first byte or latency and business aspects such as cost
per user transaction or engagement time for an end user.
When monitoring them from a performance lens perspective,
you should consider trends and business goals to decide if you
want to monitor the average or max/min values for the metrics
to achieve designed state.

Performance tests automatically: Your deployment pipeline


should enclude a test environment which spins up similar
infrastructure as the production environment. You should
have built in scripts which generate the load on the test
infrastructure to monitor performance and effect of the
changes by new builds.

Load generation: By making use of canaries or synthetic


monitoring which enable you to consistently see the
performance of your workload against end user script. By
understand how your customers see experience your
application, you can get a more realistic understanding of your
workload outcomes.

Performance visibility: Various key metrics which define your


business outcomes or workload performance should be made
visible to your team. They should be able to see the metric
changes corresponding to each build to understand positive
and negative trends over time. This also enable faster
resolution in case of operational events.

Visualization: A well-defined visualization dashboard enables


you to quickly identify issues as well as gives you an overall
picture of your workload performance. CloudWatch
dashboards can help you create multiple overall graph which
enable to see you your infrastructure performance along with
the workload outputs at the same time.

2. Evolve your workload

AWS continuously innovates to meet customer needs and you should take
advantage of that to evolve your workload.
Stay up-to-date on new resources and services: As new
services, features and design patterns are released you should
identify was to implement them.

Define a process to improve workload performance: You need


to have a process to test the performance of the workload
against new offerings. This can be using a separate test
environment or using synthetic monitoring.

Evolve workload performance over time: Based on the data


gathered from your workload metrics and testing them against
new features and services, you should evaluate the
components which would best benefit from evolving with the
changes.

Monitoring
You must monitor your architecture’s performance in order to remediate any
issues before they impact your customers. While monitoring consists of the 5
distinct phases of Generation, Aggregation, Real-time processing and
alarming, Storage and Analytics, all these solutions fall into two different
categories:
Active Monitoring: In this type of monitoring, you can simulate
the user experience across various components in your workload
by running certain scripts. These can be as simple as simulating
packet loss in your network by sending ping requests from point
A to point B or creating synthetic HTTP GET/POST requests to
add test entries into your databases.

Passive Monitoring: Usually for web-based workloads you can


use passive monitoring across your users, geographies, browsers
and device types. This can be done by native CloudWatch logs as
well as vended CloudWatch logs from your application. Further
you can trigger actions from your logs and metrics to automate
actions if the performance is less than a certain baseline.

A few things to keep in mind in your monitoring strategy are:


1. Record performance related metrics: You should identify the
metrics that matter for your workload and record them. These can
be infrastructure related and also based on user experience.

2. Analyze metrics when events or incidents occur: If and when


an outage occurs you should review diagnostics logs and metrics
to understand their impact on the performance. This provides you
with lessons to adopt additional measures to avoid incidents in
future.

3. Establish KPIs to measure workload performance: You need


to set up baseline performance and target performance of your
workload backed by customer requirements. These help you
consistently opt of improvement even if there is no incident or
event.

4. Use monitoring to generate alarms-based notifications: Your


monitoring system should be integrated with alarms and
automated actions to alert your operations team to take corrective
actions.

5. Review metrics at regular intervals: Various stakeholders,


including but not limited to operations, product and management
team should review the metrics at regular interval. Based on
workload changes you should adopt new metrics to help identify
and prevent new issues.

Trade-offs
You cannot always get the best of all worlds and you need to think of trade-
offs to ensure an optimal approach. For example, a key-value data store can
provide a single millisecond latency for queries but you will have to design
your application to use NoSQL based queries rather than the traditional SQL
based data access pattern. You can use trade-offs to improve performance by
following certain best practices:
Understand the areas where performance is most critical:
You should identify the areas where the performance of your
workload will have better customer experience. For example, a
website service large number of customers spread globally would
improve user experience by using CloudFront for content
delivery.
Learn about design patterns and services: There are many
reference architectures which may be different from your current
workload environment but by adopting them you can make them
more efficient. The Amazon’s builder library contains various
tried and test reference architectures and methods which you can
use to innovate your resources with certain trade-offs but achieve
eventual efficiency.

Identify how trade-offs impact customers and efficiency:


While adopting solutions such as key-value databases may
improve transaction times, they also use eventual consistency,
which be affect some other workloads. You need to identify what
changes you need to make to make the efficient use of the trade-
offs make sure their adoption provers customer experience and
workload efficiency.

Measure the impact of performance improvements: Once you


implement changes you need to measure their impact using
metrics to identify if the trade-offs result in any negative impact.
If choosing a certain strategy has some un-intended side effects,
then you can use of a combination of performance related
strategies to have a well-architected system.

Use various performance-related strategies: By implementing


multiple strategies to improve performance of your workload you
can improve the end-user experience. For example, by using
caching solutions you can reduce load on your network and
database and by using read replicas you can improve the database
queries.

Review based on Performance Efficiency Optimization


Pillar
Overall, you should data driven approach to achieve and maintain
performance efficiency of your environment. Some of the questions which
you will be going through when you review your workload against the
performance efficiency pillar of the Well-Architected Framework are:
PERF 1: How do you select the best performing architecture?
PERF 2: How do you select your compute solution?
PERF 3: How do you select your storage solution?
PERF 4: How do you select your database solution?
PERF 5: How do you configure your networking solution?
PERF 6: How do you evolve your workload to take advantage of new
releases?
PERF 7: How do you monitor your resources to ensure they are
performing?
PERF 8: How do you use trade-offs to improve performance?
The answers to these questions during the review help you identify any gaps
in your existing workloads and implement the best practices in your AWS
environment.

Further Reading
https://d1.awsstatic.com/whitepapers/architecture/AWS-Performance-
Efficiency-Pillar.pdf
Ace the Operational Excellence
pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of pages
of the Well-Architected framework, this guide serves as an abstract of the
Operational excellence pillar of the Well-Architected framework.
Understanding the Operational excellence pillar, will ensure that you are
equipped with the knowledge of the operational best practices which you can
apply on your workloads in the cloud

The Operational Excellence Pillar

The operation excellence pillar was added as a part of the Well-Architected


framework later as a realization of the fact that customers in cloud may have
expertise in creating but may be not operations. Often organizations may
ignore the operational outcomes or may think about it later. However, we
should consider operational excellence as one of the legs of a 3-legged stool,
where the other two legs are business and development. If you choose to
stand on the stool without operational excellence, then it may be a very
dangerous balancing act which you are opting for.
Let’s see how we can apply this as a foundation of your well-architected
solutions.
Components
As any of the pillar of the Well-Architected framework, the security pillar
also covers the following two broad areas:
Design Principals
Definitions

Design Principals

There are several design principles that are highlighted in the operation
excellence pillar, part of a well architected framework that you need to
consider following if you wish to achieve operational excellence for your
applications that are hosted at AWS:
8. Perform operations as code

Since your cloud infrastructure is all software and has virtual components,
they can all be automated. You can start by using Infrastructure as a code
services to deploy the environments and then introduce automated in logging,
monitoring and change management.
9. Make frequent, small, reversible changes

Your environment will need changes over time and you should determine a
frequency of these changes depending on their impact and also ensure they
are frequent but at the same time small changes which do not impact
customers. You should make use of snapshots wherever possible to reverse
those changes if required.
10. Refine operations procedures frequently

As new services and features are introduced, your operational procedures


should be able to incorporate them to make them more efficient. Your
operations team should be able to review them frequently and make sure they
are tested and documented in timely manner.
11. Anticipate failure

You need to prepare for failure by proactively testing various scenarios and
having automations in place to remove or mitigate the failures.
12. Learn from all operational failures

Any kind of operational failure should be taken as a lesson to not repeat the
same mistake again. This starts with a post mortem and includes corrective
actions and ends with sharing information at all organizational levels.

Definitions

AWS outlined five focus areas that encompass Operational excellence in the
cloud:
Organization
Prepare
Operate
Evolve

Let’s go into the details of each of them to understand them better and how
we can use them secure our Cloud environment.

Organization
Organization Priorities
Understanding the organizational priorities and reviewing them timely
ensures that you reap the benefits of all the efforts that have been put in place
and achieve your business goals.
1. Evaluate external customer needs

As a collective proactive all the important stakeholders from various business


verticals should evaluate what the external customer needs are and how you
can best serve them.
2. Evaluate internal customer needs

Your internal customers, which include your workforce and other teams in
organization as equally important as your external customers. The key
stakeholders should folks on their needs as well.

3. Evaluate governance requirements

You decide your priorities, you should be aware of the overall management
approach decided by the senior executives to control and direct your
organization. This ensures that strategies, directives and instructions from
management are carried out systematically and effectively.
4. Evaluate external compliance requirements
Similar to governance requirements, by being aware or the compliance
requirements, your organization can demonstrate you have conformed to
specific requirements in laws, regulations, contracts, strategies and policies.
5. Evaluate threat landscape

You consider all the threats to your business be it competition, liability,


operational or information security related. The list of these threats should
also be constantly updated to determine your priorities.
6. Evaluate trade-offs

In case of multiple alternatives and objectives, you need to consider the trade-
off of choosing one over other and if losing something means forgoing a
benefit or opportunity against your business outcomes.
7. Manage benefits and risks

You need to be able to take calculated risks and take decisions which can be
reversed. Considering a benefit of a particular change may result in being
exposed to a risk, however if you can manage that or easily revert the change
then that is worth trying.
Operating Model
A well-defined organizational operating model, gives a clear understanding
of responsibilities will reduce the frequency of conflicting and redundant
efforts. This further helps achieve business outcomes due to a strong
alignment and relationships between business, development, and operations
teams.
There are two main aspects of Operating model:
1) Operating
Model 2 by 2
Representations:
With the help of
various
illustrations,
you can
understand the
relationship
between teams
in your
environment.
You can use one
or more than
one of the
operating
models
depending on
your
organizational
strategy or stage
of development
Fully Separated Operating Model

The activities in each quadrant are performed by a separate team.


Work is passed between teams through mechanisms such as work
requests, work queues, tickets, or by using an IT service management
(ITSM) system. While there are clear sets of responsibilities, such
model has a risk of teams becoming narrowly specialized, physically
isolated, or logically isolated, hindering communication and
collaboration.
Separated Application Engineering and Operations (AEO)
and Infrastructure

Engineering and Operations (IEO) with Centralized Governance


This model the follows a “you build it you run it” methodology. Your
application engineers and developers perform both the engineering and the
operation of their workloads. Similarly, your infrastructure engineers perform
both the engineering and operation of the platforms they use to support
application teams.
Separated AEO and IEO with Centralized Governance and
a Service Provider

This is similar to the previous one, however by addition of a service


provider such as AWS Managed services, you can make benefit from
their expertise in setting up cloud infrastructure and meet security and
compliance requirements.
Separated AEO and IEO with Decentralized Governance

In this model while the Application and Infrastructure teams still have
separate responsibilities, with the decentralized governance, the
application team has fewer constraints. They are free engineer and
operate new platform capabilities in support of their workload.
2) Relationships
and
Ownership
You may choose any type of operating model; however, you need to
have a clear understanding of the ownership of various resources and
processes. The team members should be well aware of their
responsibilities and identified owners need to have a set performance
target to continuously improve the business outcomes.

Organizational Culture
Often org culture may be very ingrained in the org. You have to put this in
your culture that you are supporting your team members effectively. So that
the team members can support operations and help you realize the desired
business outcome.
1. Executive Sponsorship
Your Senior Leadership should be the sponsor, advocate, and driver for the
adoption of best practices and evolution of the organization.

2. Team members are empowered to take action when outcomes are


at risk

The operations team members should have enough resources and escalations
mechanisms to take respond to events which may impact business outcomes.
3. Escalation is encouraged

Continuing on the empowerment bit, the team members should not hessite to
escalate to the highest authorities to move things and practise should be
followed to escalate in time.
4. Communications are timely, clear, and actionable

Be it planned maintenance activities, sales events or a change freeze window,


all events should be communicated by including the context, details and time.
This allows the team to prepare for such events and does not catch them by
surprise when an issue occurs. Services such as AWS Systems Managers can
be used to maintain the operational change calendar and notify stakeholders.
5. Experimentation is encouraged

You cannot invent without experimenting, and it should happen constantly. A


lot of experiments may fail and your team members should not be chastised
for a failure. Instead you may assist your workforce to experiment with
proper organization.
6. Team members are enabled and encouraged to maintain and grow
their skill sets

In order to enable team members to work on new technologies and support


additional responsibilities, they should be encouraged to develop their skills.
Moving to cloud itself can be a big learning curve and with the frequent
introduction of new services and features, a learning culture plays a
significant role in keeping up to speed. Various AWS resources like Blogs,
Online Tech Talks, Certifications, Events and Labs can be used to support
your team members upskilling.
7. Resource teams appropriately

Your team members should have well-defined roles and responsibilities. In


order to make them support your workloads effectively, they should be
provided with sufficient resources and tools.
8. Diverse opinions are encouraged and sought within and across
teams

You should seek diverse perspective about your approaches through cross
team collaboration, this reduces the risk of confirmation bias and leads to
multiple idea generation.

Prepare
Prepare is all about setting things up for success through telemetry,
development tool chain and making informed decisions. There are four main
areas of engagement:
Design Telemetry
Telemetry allows you to gather important data from your resources and keep
you in control of your workload.
1. Implement application telemetry

Your applications should be designed to publish metrics related to


performance, status and desired results. By using CloudWatch logs agent on
your EC2 instances you can export logs as well as publish custom metrics to
Amazon CloudWatch endpoint
2. Implement and configure workload telemetry

Similar to application, your workloads should also publish data related to its
status. Almost all the services have relevant CloudWatch metrics which you
can monitor and for the custom workloads, you can publish custom metrics to
CloudWatch to monitor metrics such as HTTP status codes, API latency etc.
3. Implement user activity telemetry

By gathering information about advanced analytics related to user activity


such as transactions, click patterns and engagement, you can make informed
business decisions and custom targeting for your end users.
4. Implement dependency telemetry

Apart from the critical components, you should also configure your
workloads to emit telemetry data for resources on which your workloads are
dependent on. This could be vendor systems, internet weather or external
databases.
5. Implement transaction traceability

In the world of complex applications, countless transactions happen between


components and having the workloads to emit the information about
transactional flow can help you identify issues faster. AWS X-Ray helps you
analyse and debug production, distributed applications seamlessly.
Improve Flow
1. Use version control

Version control for your workloads enable you to track changes and releases
which can further help you to introduce small incremental changes as well as
help in rollback whenever necessary. Services such as AWS CodeCommit
and CloudFormation can help you in managing version control of your
infrastructure as well as code.
2. Test and validate changes

With cloud you don’t have to worry about the cost and time to set up
new infrastructure. For any changes you can follow A/B testing or
blue-green deployments with a parallel infrastructure and test your
changes.
3. Use configuration management systems

Instead of manual methods you can use configuration management systems


in form of AWS Native services such as Systems Manager or hybrid
solutions like Chef, Puppet etc. This reduces the efforts as well as errors in
changes to the environment.
4. Use build and deployment management systems

Similar to config management, you can also your build and deployment
systems in form of CI/CD pipelines using AWS developer tools such as Code
Commit, Code Deploy and Code Pipeline
5. Perform patch management

Performing system patches in a timely and organized manner not only


reduces the security risks but also help you avail the latest features available
with new versions. Most of the AWS Managed services take away the burden
of updating the latest patches and for services like EC2, you can use Systems
Manager to update patches in your operating systems after testing.
6. Share design standards

Having a shared design standard enables ensure minimal development efforts


for your teams working in isolation. Your teams can share these designs
through AWS Lambda, CodeCommit and S3 and also make use of SNS to
keep them updated with the latest changes.
7. Implement practices to improve code quality

By using test-driven development, code reviews and guidelines, you can


improve the code quality which minimizes defects as well as development
efforts.
8. Use multiple environments

By having separate development, test and production environment you can


ensure your workload operates as intended in the final deployment. This can
be done through different stages in API Gateways or by using different VPC
for each environment.
9. Make frequent, small, reversible changes

Backed by configuration, build and deployment management solutions, you


can make small changes which increase the pace of innovation and also are
easy to troubleshoot and rollback in case of any issues.
10. Fully automate integration and deployment

The entire chain of build, deployment and testing should be automated to


reduce errors and deployment efforts.
Mitigate Deployment Risks
1. Plan for unsuccessful changes

With the help of timed snapshots or backups and a rollback plan in place, you
can ensure that even if there is a failed deployment your production
environment can continue to run as desired.
2. Test and validate changes

Any changes in your lifecycle stages can be tested and validated by creating
parallel systems. You can also make sure of AWS CloudFormation to deploy
changes which allow you to see the effect of a drift and also easily rollback
them.
3. Use deployment management systems

By building a CI/CD pipeline using AWS services such as CodeCommit,


CodeDeploy, CodePipeline etc, you can track and implement changes in an
automated process with minimum errors.
4. Test using limited deployments

Instead of full-scale changes you can test using deployment canary testing or
one-box deployments to confirm the desired outcome of your changes.
5. Deploy using parallel environments
With the help of blue green deployments, you can deploy changes in a new
environment and route traffic to the new ones. A simple example would be
creating a new load balancer and ec2 instances for the new environment, once
you are ready to move the changes to production, you can simply change the
DNS record in Route53 to point to the new ALB.
6. Deploy frequent, small, reversible changes

A change with a smaller scope results in faster remediation and easier


troubleshooting. You can do a lot of them to keep up the pace of innovation.
7. Fully automate integration and deployment

Instead of manual efforts, your integration and deployment should be fully


automate to reduce efforts and errors.
8. Automate testing and rollback

With the help of canaries and various test benches, testing of your changes
should be automated with mechanisms in place to automatically roll them
back for a minimal production impact.
Understand Operational Readiness
1. Ensure personnel capability

With the help of AWS training resources, you can ensure that your workforce
from various domains is equipped with the knowledge to run operations
successfully.
2. Ensure consistent review of operational readiness

You should have recurring review of your operational readiness to ensure it


can run your workloads. Services such as AWS Config rules and Security
hub ensure that your environment is aligned with best practices and
standards.
3. Use runbooks to perform procedures
You should run automated runbooks to respond to events or achieve an
outcome. By having the runbooks in form of a code, you can ensure
consistent error free execution.
4. Use playbooks to identify issues

Similar to runbook you need to have automated playbooks to investigate


issues. This helps you have a consistent and prompt response in case of a
failure of the system.
5. Make informed decisions to deploy systems and changes

You should analyse the benefit and risk of your deployments before making
changes and evaluate against your workforce capabilities and governance
requirements.

Operate

As an important part of operations, you should be aware of your workload


and operations performance through KPIs and metrics. You should also be
equipped with processes to respond to events which may impact their
performance.
Understand Workload Health
1. Identify key performance indicators

You need to identify the key performance indicators of your business and
customer outcomes which determine the if your workload is working
efficiently towards desired results. These can be in terms of orders, revenue,
customer satisfaction score etc.
2. Define workload metrics

Based on your KPIs you should measure the performance of your workload.
Relating the business performance with the workload metrics can help you
benchmark your operational efforts. With the help of CloudWatch log agent
on your server, you can determine the custom metric which you need to
define and monitor.
3. Collect and analyze workload metrics

From various type of workloads such as applications, API calls, dependent


services etc. you can collect logs and aggregate them to CloudWatch logs or
export them to S3. You can further analyze them by using CloudWatch
Insights or Glue to make meaningful conclusions
4. Establish workload metrics baselines

An appropriate baseline for each metric to setup which helps you identify if
your workload is delivering expected results. If there is a a metric exceeding
threshold, then it should result in a trigger to investigate.
5. Learn expected patterns of activity for workload

By continuously analyzing your workload pattern, you can enable anomaly


detection quickly if there is a deviation from normal. Services such as
CloudWatch anomaly detection can help you surface anomalies with
minimum user intervention.
6. Alert when workload outcomes are at risk

If you are using metrics to monitor your workload performance, you should
have alarms in place to alert you if they exceed a certain threshold.
7. Alert when workload anomalies are detected

Similarly, outcomes of CloudWatch anomaly detection should be configured


to alert you with an alarm if any anomaly is detected in your logs.

8. Validate the achievement of outcomes and the effectiveness of


KPIs and metrics

All the monitoring efforts should be validated by using business intelligence


tools and re-iterate to make changes till they are align with your business
goals.
Understand Operational Health
1. Identify key performance indicators

You need to identify the key performance indicators of your business and
customer outcomes which determine the if your operations are working
aligned towards desired results. These can be in terms of new feature
releases, customer cases or uptime of your services.
2. Define operations metrics

You should have operational metrics by which you can measure their
effectiveness in achieving the KPIs. For example, Time to resolve (TTR) an
incident or new deployment success rate can be considered as metrics to
measure your business KPIs of uptime or new feature release.
3. Collect and analyze operations metrics

You can aggregate the operational metrics from various source in


CloudWatch Logs and S3 and analyse them through CloudWatch Insights or
Glue.
4. Establish operations metrics baselines

Once the metrics are in place, you can baseline them according to your
business outcomes and put in efforts to improve them towards a
benchmarking criterion.
5. Learn expected patterns of activity for operations

By a consistent logging, you can identify patterns and anomalies in behavior


if there is a deviation from usual.
6. Alert when workload outcomes are at risk

If your metrics go beyond a threshold which impact your workloads, you can
configure alerts with the help of CloudWatch alarms to be notified in time.
7. Alert when operations anomalies are detected

You may not always have set threshold for your metrics and in these case
CloudWatch Anomaly detection can help you by identifying expected values
by pattern recognition and machine learning.

8. Validate the achievement of outcomes and the effectiveness of


KPIs and metrics

Your business and leadership team should review your operational KPIs and
metrics to see if they align with the business goals and provide
recommendations if necessary.
Respond to Events
1. Use processes for event, incident, and problem management

You should anticipate planned and unplanned operational events and


processes and tools should be in place for each category to reduce the time to
remediate them. For AWS resources you can use OpsCenter in the systems
manager to get a summary of all the issues in your resources at once and also
can take corresponding actions from there.
2. Have a process per alert

You should have a playbook every alert and that should be updated regularly
to avoid surprises in case of an incidence. These processes should have a
well-defined owner and emphasis should be laid on making them automated
as much as possible.
3. Prioritize operational events based on business impact

Depending on the business impact and criticality of the event, your


operational events should be categories and focus should be put in rectifying
them accordingly. This would also mean that any feature release change
could be de-prioritized over an event which impacts customer orders.
4. Define escalation paths

Having well defined escalation paths in in your runbooks and playbooks


ensures that you have the right person to contact in case of an event and
enables you to act quickly if a human action is required.
5. Enable push notifications

With the help of push notifications in form of email or SMS, you can keep
your users aware of any service impact and progress to the investigation.
6. Communicate status through dashboards

You can have internal and external dashboards to communicate the status of
your services and appropriate metrics. With the help of CloudWatch
dashboard and Amazon QuickSight various stakeholders can be made aware
of the latest status and use the data to relate to other dependent services.
7. Automate responses to events

With every iteration you should try to automate the most common scenarios
to reduce error and time take to remediate problems. With CloudWatch
alarms you can define actions specific to EC2 or use SNS to trigger lambda
functions for custom logic.
Evolve

Evolve is all about continuous improvement. As an organization you can stop


the same bad things to happening again or may be just make new mistakes
and not repeat the old mistakes. You can make small changes in your
processes or code make incremental improvement to your environment. The
combination of the following helps your organisation evolve its operational
excellence over time:
Learn, Share, and Improve
1. Have a process for continuous improvement

Rather than one-time efforts, there should be standardization of continuous


improvement efforts and processes should be put in place on how to make
small improvements.
2. Perform post-incident analysis

You should have processes in place by which you can review the root
causes of customer or business impacting events. Rather than putting
a blame on anyone, efforts should be put to ensure that the same
mistakes or errors do not happen again. A simple exercise would be
assembling all the stake holders and asking the 5 WHYs.
3. Implement feedback loops

With the help of feedback loops in your workload and processes, you
can be informed about issues and improvement areas on a recurring
basis. This helps your environment evolve over time.
4. Perform Knowledge Management

Your workforce should be equipped with the knowledge they need to do their
job effectively. This means that the contents should be refreshed regularly
and old information archived.
5. Define drivers for improvement

By aggregating logs from various workloads, services, applications and


infrastructure, you can have a detailed view of how the entire ecosystem of
your business is functioning. You can further visualize them using
QuickSight by correlating various metrics and come up with improvement
plans.
6. Validate insights

Instead of working in isolation, you should validate the performance of your


workloads with the dependent environments and business teams. This not
only provides additional scope of improvement, but also helps you learn the
best practises from other teams and how all of them and work towards
achieving the organizational goal.

7. Perform operations metrics reviews

Your operational metrics reviews should be done by the leadership from


various business areas. This gets you
8. Document and share lessons learned

Keep shared repository of your documentation for all your teams which can
be used by them as a reference to avoid repeating same mistakes again.
9. Allocate time to make improvements

You can set up dedicated time for improvement of your operations and gather
cross-team members to participate in activities. By setting up parallel
environments and manually breaking them can help you test your process and
tools and come up with improvement plans.

Review based on Organization Excellence Pillar


Some of the questions which you will be going through when you review
your workload against the security pillar of the Well-Architected Framework
are:
OPS 1: How do you determine what your priorities are?
OPS 2: How do you structure your organization to support your business
outcomes?
OPS 3: How does your organizational culture support your business
outcomes?
OPS 4: How do you design your workload so that you can understand its
state?
OPS 5: How do you reduce defects, ease remediation, and improve flow into
production?
OPS 6: How do you mitigate deployment risks?
OPS 7: How do you know that you are ready to support a workload?
OPS 8: How do you understand the health of your workload?
OPS 9: How do you understand the health of your operations?
OPS 10: How do you manage workload and operations events?
OPS 11: How do you evolve operations?
The answers to these questions during the review help you identify any gaps
in your existing workloads and implement the best practices in your AWS
environment.

Further Reading
https://d1.awsstatic.com/whitepapers/architecture/AWS-Operational-
Excellence-Pillar.pdf
Ace the
Cost Optimization pillar
AWS Well-Architected Framework
Introduction
For a busy person who may not have a time to go through the 100s of
pages of the Well-Architected framework, this guide serves as an
abstract of the cost-optimization pillar of the Well-Architected
framework. Understanding the cost-optimization pillar, will ensure that
you are equipped with the knowledge of the best practices for
achieving cost efficiency which you should implement on your
workloads in the cloud.

The Cost Optimization Pillar

One of the things which has been least well understood as customers
transition into cloud around cost optimization. One of the challenges in the
past was that people building the systems be programmers or architectures
rarely had the access to the cost of the components that they are using to
build those systems. They were using servers and using databases but never
exposed to the cost for these components. And with cloud this is changing
and getting more and more engineers being aware about cost of the
components rather than just the finance guys. The cost optimization pillar
seeks to empower you to maximize value from your investments, improve
forecasting accuracy and cost predictability, create a culture of ownership and
cost transparency, and continuously measure your optimization status. Let’s
get started.

Components
As any of the pillar of the Well-Architected framework, the Cost
Optimization pillar also covers the following two broad areas:
Design Principals
Definitions

Design Principals

Following the various design principals highlighted in this pillar can help you
with securing your applications:
1. Implement cloud financial management

In order to become a cost-efficient organization, you need to invest in Cloud


Financial Management by having a program which implements having
dedicated resources and processes in place for cost optimization.
2. Adopt a consumption model
Depending upon your workload requirement, you need to implement the
resource usages. By automatically stopping resources which are not in use
you can easily save on 75% of the costs.
3. Measure overall efficiency

You need to measure the impact of business outcomes like revenue, customer
gains etc against the input cost of running your workloads. This helps you
understand the impact of increase in cost for your business goals.
4. Stop spending money on undifferentiated heavy lifting

Your focus should be on customers or business projects which help you


achieve your organization goals rather than building IT infrastructure, which
can be taken care by making use of Cloud.
5. Analyse and attribute expenditure

By properly tagging your resources, you can identify the cost per project,
workload or department. This helps establish the ROI on your business
efforts and accordingly create a room for optimization.

Definitions
AWS outlined five focus areas that encompass cost optimization in the cloud:
Practice Cloud Financial Management
Expenditure and usage awareness
Cost-effective resources
Manage demand and supplying resources
Optimize over time

Let’s go into the details of each of them to understand them better and how
we can use them to optimize costs in our Cloud environment.
Practice Cloud Financial Management

Cloud financial management answers the “How” of the cost optimization.


This defines the change in entire culture of your organization as it moves to
cloud or seeks optimization to realize business value and financial success. It
involves following best practices.
Functional ownership:
You need to establish a cost optimization function. This function will be
responsible for establishing and maintaining a culture of cost awareness.
Depending on the organization size, it can be an individual or a team which
diverse set of skills sets ranging from project management, financial analysis
to software development. An executive sponsorship ensures support for well-
functioning of the responsibilities and helps in defining organizational goals
for cost optimization in the cloud.
Finance and technology partnership:
As compared to the traditional data-centres based environments, technology
teams innovate faster in the cloud due to the reduced time for approval,
procurement and infrastructure deployment cycles. This needs you to
establish a partnership between finance and technology teams. The two most
relevant teams that should be involved in regular discussion in your cloud
journey are:
Financial leads:
The CFOs, commercial and account managers should understand the cloud
model of consumption and purchasing. Since there is a shift in billing to pay
as you go pricing from a fixed pricing in the on-prem operations it is essential
that financial organizations understand how the usage of their cloud resources
impact the cost occurred to the business.
Technology leads:

Similarly, the product and technology leads should understand the budgets
and service level agreements. These financial requirements should be kept in
mind while designing cloud-based workloads for your business applications.
This partnership helps both the teams have real-time visibility into costs and
also establish a standard operating procedure to handle variance in cloud
spending.
Additionally, business unit owners and third parties should understand the
cloud business model so that they are aligned with the financial goals and
work towards optimal return of investments (ROI)
Cloud Budgets and Forecasts
The efficiency, speed and agility offered by cloud means there can be high
variable amount of cost and usage. You can use AWS cost explorer to
forecast daily or monthly cloud costs based on your historical cost trends.
Your existing budgeting and forecasting processes should be modified to take
inputs from AWS cost explorer to identify trends and business drivers.

Cost-Aware Processes
5. Implement cost awareness in your organizational processes

Cost awareness must be implemented in all new and existing


processes in your organization. This includes:
Change management: Changes to your workloads or
infrastructure should quantify the financial impact.
Operations management: Your existing incident management
process should be modified to identify the root causes for
variance (increase and decrease) of the cost of running your
workloads.
Automation: By investing in automation and tooling you
organization can accelerate cost savings and business value
realization.
Training and development: Continuous training and certification
of various stakeholders in your organization by including cost
awareness topics helps build a workforce which is capable of
self-managing cost and usage.

6. Report and notify on cost and usage optimization

By using AWS cost explorer and AWS Budgets you can regular
report cost and usage optimization withing your organization. This
should not be limited to the management or financial teams but should
be extended to all the stakeholders including technology teams. You
can further customize reports with the Cost and Usage Report (CUR)
data with Amazon QuickSight which can help create reports
according to target audiences.
7. Monitor cost and usage proactively

Rather than investigating for anomalies you should have proactive


monitoring of cost and usage. You can make use of dashboard which
can be made accessible to everyone for aware of organization’s focus
on cost optimization.
Cost-aware culture
By starting with small changes, you can create an environment which is
aware of cost of achieving your busines goals. This can be in the form of
dashboards, rewarding of teams working on cost efficiency or having top-
down goals with pre-defined goals. You can also get your stakeholders to
subscribe to AWS News and Cost Management blog to make them aware of
new services and best practices which can help in increasing cost efficiency
of your workload.
Quantify business value delivered through cost optimization
When you terminate idle EC2 instances or delete unattached EBS volumes
you can quantify AWS spending through the cost reductions. Similarly, you
can quantify business values from all kind of optimization.

Expenditure and usage awareness

In order to make informed decisions about where to allocate resources within


your organization and also understand how profitable various business units
and products are, it is essential to understand the organization’s cost and
drivers for expenditures. Some of the key factors to consider in the efforts to
generate awareness for usage and expenditure are
1. Governance

High level guidelines for managing cloud usage should be established.


These can be established using following governance areas
Develop organizational policies: You need to develop
policies related creating resources and workloads for
various units and teams. Some examples including
establishing AWS region in which resources should be
run, determining the storage class to be used for
production versus development teams and maximum
instance sizes which can used for test/dev accounts.

Develop goals and targets: We expect our developers and


builders to driver the cost of the resources cheaper and
workloads more efficient, however this may not be part of
their job description. In order to set the expectations right,
the DevOps team role descriptions should also have goals
and targets in place to make the workloads more efficient.

Account structure: By levering AWS Organizations and


consolidated billings, you can set up a master account and
run workloads in member accounts. This can help you set
up set up service limits for member accounts running
specific workloads and also monitor cost and usage by
these groups.

Organizational Groups and Roles: Once the organizational


policies are set, you can create various IAM groups and
role. Depending on job roles such as system admin, IT or
financial department, individual users may be assigned
these groups who do similar tasks. Your group policies
will define what task these users are allowed to execute
and have guard rails in place for cross account access.

Controls – Notifications: You can make use of AWS


Budgets to define a monthly budget for your AWS costs
and combine it with commitment discounts. The budgets
can be even set at granular levels such as tags, availability
zones or services. Email notifications using SNS can be
triggered based on current or forecasted costs in case usage
exceeds a pre-defined threshold.

Controls – Enforcement: AWS Organization’s service


control policies (SCPs) allow you to enforce governance
policies which you can set for member accounts in your
organization. These establish the maximum available
permissions for these accounts in order to enforce them to
stay within the control guidelines. Within accounts you
can make use of IAM policies at group or user level to
control who can create and manage certain AWS resources

Controls – Service Quotas: By understanding your


resource requirements and project progress you can setup
Service Quotas which determine the number of resources
which can be created. You can increase or decrease service
quotas as per demand and at the same time stay within
your budgeted limits.

Track workload lifecycle: You need to know when a


particular workload or its resources are no longer needed
and they can be decommissioned or passed on to other
teams. This can be done by managing an inventory using
AWS Systems manager and track the lifecycle of various
resources.

2. Monitoring cost and usage

It can’t be emphasized more that by providing detailed visibility into


cost and usage to teams, they can take action to make their workloads
more efficient. The following areas the most important areas:
Configure detailed data sources: You can enable hourly
granularity in CostExplorer and create a Cost and Usage
Report (CUR) to get the most accurate view of cost and
usage in your entire organization. You can customize your
CUR to include resource ids, versioning and integrate the
data with Athena to perform data analysis.

Identify cost attribution categories: During the various


stages of a workload lifecycle such as developing, testing,
production and decommissioning you much allocate cost
to them. Further different accounts can be created for
learning and staff development in order to segregate costs
rather than attributing them to general IT costs.

Establish workload metrics: You need to determine how a


workload output impacts performance and in turn business
success. This helps in determinising the workload
efficiency and in turn the cost for each business output.

Assign organization meaning to cost and usage: By


assigning tag to each resource such as an EC2 instance or
an S3 bucket, you can use this information in Cost and
Usage reports to relate them to a meaningful
organizational information. Tags can be of relevant
categories such as cost centres, application names, projects
and owners. Even without tags, with the help of AWS Cost
categories you can assign the organizational meaning to
these costs. It allows you to map the cost and usage to
internal organizational structure.

Configure billing and cost optimization tools: You need to


setup tooling which configure reports, notifications,
current state of workloads, trends and forecast, tracking
and analysis for all your workloads and teams. Tools such
as AWS Cost Explorer, AWS Budgets, Amazon Athena
and QuickSight provide these capabilities.

Allocate costs based on workloads metrics: You can


determine workloads metrics such they their running time
i.e. continuous versus periodic or based on changes in
transaction patterns and accordingly see their impact on
cost. This helps you focus on optimization activities to
meet further business needs.

3. Decommissioning Resources

An important part of a workload is the ability to decommission


resources in a timely manner. This can be achieved by the following
practices:
Track resources over their lifetime: You can track
resources by applying tags and maintaining an inventory
using Systems Manager. A simple example would be
tagging all the testing resources and monitoring their
usage.

Implement a decommissioning process: A standard process


which determines scanning unused resources and timely
decommissioning should be established.

Decommission resources: Various factors for motivation


towards decommissioning resources should be determine.
This should include the potential cost savings, efforts
required, state change of workload, and change in market
condition or product termination.

Decommission resources automatically: Dynamic


resources can be automatically terminated by using
Autoscaling and setting up workload specific scale down
policies. You can also make use of custom code and
CloudWatch Event Bridge to automatically trigger
execution of code to decommission workload resources
automatically.

Cost-effective resources
With cloud you might be doing 10 times what you used to do and achieving
the outcomes accordingly. It becomes necessary to choose appropriate
resources, services and configuration for your workload to achieve cost
savings. The following aspects should be considered:
1. Evaluate cost when selecting services

Identify organization requirements: You should maintain a


balance between cost and other well-architected pillars
such as performance and reliability. Based on a combined
consideration, you would architect a workload which may
not necessarily have the lowest cost but a more efficient
environment.

Analyse all workload components: You should not only


analyse the large items contributing to the workloads, but
each individual component such as storage, metrics and
data transfer should be looked into to determine their
current and future impact. For example, using VPC
endpoints instead of NAT Gateway for S3 data transfer
can have a huge impact on your NAT Gateway billing.

Managed Services: AWS Managed services reduce the


operational overhead of provisioning infrastructure,
managing patches and compliance. Services such as RDS,
Redshift, EMR and Elasticsearch can help you in a faster
life and shift from your on-premises to the cloud while
taking the burden of provisioning all the resources.

Serverless or Application-level Services: Serverless


applications such as Lambda, SNS, SQS and SES remove
the need to manage dedicated resources. You can benefit
from the benefit of pay only for the workloads and
automatically scale up for performance.

Analyse the workload for different usage over time: You


need to review your workloads at a determined frequency
and evaluate against new AWS service offerings. By
having such process, you can determine the impact of
using different services with your workloads on both cost
as well as performance efficiency.

Licensing costs: There have been changes in the industry


and organizations have seen their cost towards shift in
SAAS offerings. Instead of buying a software license and
installing it, teams can now make use of SAAS offerings
and consume applications according to their requirements.
With the ability to customize more and more complex
environments in the cloud, organizations can also evaluate
open-source applications. This not only drastically reduce
cost but also enables business units to customize due to
agility offered by cloud.

2. Select the correct resource type, size, and number

By selecting the right type of resource based on type, size and


numbers, you can meet the technical requirements based on lowest
cost resource. You can consider the following approaches:
Cost Modelling: You can test your workload performance
under various load conditions and determine the right size
of resources required to run them. For running workloads,
you can make use of AWS Cost optimizer which can help
in cost modelling based on historical costs. Additional
CloudWatch logs and metrics can be used as data sources
for other custom workloads and services.

Metrics or data-based selection: Based on the cost


modelling you can come up with a data-based approach to
select your resources with the right amount of compute,
memory and throughput.

Automatic selection based on metrics: You can also make


use of AWS services such as Autoscaling to automatically
select the right cluster size for your workload based on
metrics. Similarly, S3 intelligent tiering can be used to
automatically move data between various storage tiers
based on access patterns of your stored objects.

3. Select the best pricing model

Perform workload cost modelling: Based on your


workload requirement you can figure out the potential
pricing models. Some factors to consider are availability,
time-based load or if independent resources are being used
to run the workloads.

Perform regular account level analysis: You can also check


if your workloads are mostly On-Demand, in that case you
can consider implement a commitment-based discount.
This analysis has to be done for period ranging from a
week to few months.

Pricing models: AWS offers multiple pricing models that


allow you to pay for your resources in the most cost-
effective way that suits your organization’s needs. Here
are the current pricing models available:

On-demand: This is the default pricing model where


you pay for your resources as you consume them.
You can increase or increase your resource capacity
such as EC2 instance or DynamoDB on-demand.
This model is suited for medium term workloads ( a
few months) which have unpredictable utilization.

Spot: Use can use the unused EC2 capacity offered


by AWS at discounts of up to 90% as compared to
on-demand instances and there is no long-term
commitment required. This is usually suitable for
non-critical workloads which do not need a specific
time to run. You should to set the maximum pricing
as the on-demand pricing and be flexible about the
availability zones where the workloads will run to
fulfill your target capacity and get maximum
benefits. Additionally, you can keep your fleet as a
combination of on-demand and spot instance to
ensure a minimum capacity is always available to
run your workloads.

Commitment discounts – Savings plans: If you have


a workload which needs to run for longer duration,
you can sign up for Savings Plan allowing you to
make an hourly spend commitment for one or three
years. This can be used with services such as EC2,
Fargate and Lambda. This can provide you a
discount rate which is applied on the On-Demand
costs with flexible payment options such as upfront,
partial upfront or no upfront.

Commitment discounts – Reserved instances:


Instead of hourly spend commitment, reserved
instance need you to commit to a specific amount of
resource utilization. There are also options for
convertible reserved instances through which you
can change the instance family, tenancy and
operating systems of your committed EC2
resources.

EC2 Fleet: You can specify a target compute


capacity and specify the instance types through EC2
fleet, which will balance the on-demand and spot
instance to meet your fleet requirements.

Geographical selection: While the first criteria you


should consider while selecting a region is the
closeness to the region, in case of multiple options
you should check the service pricing for each of
them to get the lowest possible pricing. AWS
Simple Monthly calculator can be used to estimate
costs across various regions.

Third-party agreements and pricing: When you use


third party software with your cloud workloads, you
should consider if they align with your cost
optimization goals. There would be software where
the licensing costs increase with increase in
workloads, you should evaluate if the increased
costs also have an impact on workload performance.

4. Plan for data transfer

Data transfer is one of the most undermined cloud cost components


and usually ignored. However, it can play a big role achieving your
savings objectives if you plan to optimize it.
Perform data transfer modelling: You need to understand
the where the data transfer happens in your environment
and identify the cost associated with it. Factors such as
inter-availability zone data transfers should be considered
to evaluate if the workloads can be optimized to achieve a
certain balance between resiliency and cost.

Optimize Data Transfer: There are various ways in which


you can optimize the transfer of data within your workload
and to your users. Content delivery networks take your
data closer to the users and reduce the associated transfer
costs to server them. Similarly, for high traffic patterns in
may better to have separate NAT Gateways in each AZ to
avoid data transfer cost and achieve resiliency.

Select services to reduce data transfer costs: You can make


use of dedicated Direct connect link to reduce the data
transfer cost over the internet and also get a consistent
performance over connectivity. For data transfer between
VPC and S3 or DynamoDB, instead of using NAT
Gateways you can consider using VPC Endpoints which
reduce public data transfer cost as well as NAT Gateway
processing charges.

Manage demand and supplying resources

Since cloud offers you the pay as you go model, you can eliminate the need
for costly as wasteful overprovisioning. With on-demand provisioning of
resources you can ensure that you have resources running only when you
need them and scale up or down when required. You can use the following
approaches to manage demand and supplying resources:
1. Analyse the workload

You need to analyse the predictability and repeatably of your


workload resource demand. This should also consider the change of
demand and acceptable minimum and maximum delay in case of
failures. By making use of AWS Cost explorer and QuickSight along
with CUR you can perform a visual analysis of your workload
demand.
2. Manage Demand

Throttling: For the workloads which have a retry capability


you can implement throttling. This allows the source to
wait for a certain period of time and re-try the request and
in turn allows you to minimize simultaneous usage of your
resources, thus optimizing costs associate to run them.
Buffer based: A buffer-based mechanism can be
implemented by using a queue to accept messages from
various producers and then getting the consumers to
process them. Amazon SQS allows you to implement a
buffering approach by running the work items (messages)
that is acceptable by the consumers processing and reduces
the overall load of demand by producers. With Amazon
Kinesis you can have a stream which allows the messages
to be consumed my multiple clients at the same time.

3. Dynamic Supply

Demand-based supply: When there is a spike in traffic or


increase in a resource utilization, you can make use of
various services to programmatically scale your
architectural components. You can monitor the resource
workload using CloudWatch metrics and accordingly have
scaling polices in place to provision additional resources.
Time-based supply: If you have a demand that is
predictable and well-defined by time then you should
consider the time-based approach. An example of this
would be processing of business reports at the end of the
day. Since it falls at the specific time, you can make use of
Autoscaling scheduled scaling to spin up resources at the
particular time and terminate them once the job is
done. You can also make use of CloudWatch event bridge
and schedule cron-job to trigger lambda functions which
provision custom resources at a pre-defined time.

Optimize over time

It is important to optimize over time by reviewing new service offerings and


changes to your requirements. Consider the following to have a consistent
optimization-based environment:
1. Develop a workload review process

You must establish a process to review your workload periodically to


ensure that your architectural decisions remain cost effective over
time. As your business grows you may have outdated or legacy
components which may need updating or revesting in terms of their
contribution to the overall costs. For example, your initial architecture
may have individual small databases which have duplicate data over
time, you may consider using a central database which serves
different parts of your workloads. In order to have an optimal review
process you may review the workloads which contribute to 50% of
your overall costs more frequently than the ones which contribute less
than 5%.

2. Review the workload and implement services

As new services and features are released your review process should
consider implementing them after analysing the business impact of
making the changes. There are various AWS blog posts channels
which your teams can subscribe to in order to be up to date with the
latest offerings.

Review based on Cost Optimization Pillar


Some of the questions which you will be going through when you review
your workload against the security pillar of the Well-Architected Framework
are:
COST-1 How do you implement cloud financial management?
COST-2 How do you govern usage?
COST-3 How do you monitor usage and cost?
COST-4 How do you decommission resources?
COST-5 How do you evaluate cost when you select services?
COST-6 How do you meet cost targets when you select resource
type, size and number?
COST-7 How do you use pricing models to reduce cost?
COST-8 How do you plan for data transfer charges?
COST-9 How do you manage demand, and supply resources?
COST-10 How do you evaluate new services?

The answers to these questions during the review help you identify any gaps
in your existing workloads and implement the best practices in your AWS
environment.

Further Reading
https://d0.awsstatic.com/whitepapers/architecture/AWS-Cost-Optimization-
Pillar.pdf

You might also like