0% found this document useful (0 votes)
225 views

Site Reliability Engineer Nanodegree Program Syllabus

This document provides an overview of the Site Reliability Engineer (SRE) Nanodegree program. The goal of the program is to equip software developers with engineering and operational skills to build automation tools and responses that ensure systems meet requirements like availability, performance, security and maintainability. The 4-month, self-paced program includes video lectures, projects and mentor support. It covers topics like monitoring, high availability, disaster recovery, and database recovery. Graduates will be able to use strategies to identify reliability risks, develop service level objectives, create self-healing architectures, and design organizational processes to enhance reliability.

Uploaded by

vannda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views

Site Reliability Engineer Nanodegree Program Syllabus

This document provides an overview of the Site Reliability Engineer (SRE) Nanodegree program. The goal of the program is to equip software developers with engineering and operational skills to build automation tools and responses that ensure systems meet requirements like availability, performance, security and maintainability. The 4-month, self-paced program includes video lectures, projects and mentor support. It covers topics like monitoring, high availability, disaster recovery, and database recovery. Graduates will be able to use strategies to identify reliability risks, develop service level objectives, create self-healing architectures, and design organizational processes to enhance reliability.

Uploaded by

vannda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

NANODEGREE PROGR AM SYLL ABUS

Site Reliability
Engineer

Need help? Speak with an Advisor: udacity.com/advisor


Overview
The goal of the Site Reliability Engineer (SRE) Nanodegree program is to equip software developers with
the engineering and operational skills required to build automation tools and responses that ensure
designed solutions respond to non-functional requirements such as availability, performance, security, and
maintainability. The content will focus on both designing systems to automate response to issues with software
sites as well as how to respond to common on-call situations.

Prerequisites
A well-prepared learner is already able to:
• Write basic functions in an object-oriented language (Python or Java), such as for loops, conditionals, Control
Flow, Python Methods, Java Methods, etc.
• Write basic shell scripts in Bash or Powershell, which could include for loops, conditionals, scripting, etc.
• Understand Linux command-line (bash/shell) and UNIX Shell.
• Create simple SQL queries using SELECT, JOINS, GROUP BY functions.
• Exercise networking skills including knowledge of virtual networks, DNS, subnets, and basic network
troubleshooting techniques.
• Perform DevOps tasks, such as setting up monitoring, doing feature rollout, troubleshooting production
issues, ideally for large systems.
• Work with Kubernetes and basic kubectl, such as kubectl apply, kubectl create, kubectl config.

Educational objectives
A graduate of this program will be able to:
• Use proactive and reactive SRE strategies (monitoring, postmortem, team building, etc.) to identify reliability
risks through evaluating systems and processes.
• Develop customer-centric SLOs (such as percentile targets for availability, latency, and correctness) and set up
corresponding monitoring and risk mitigation measures to ensure customer happiness.
• Create and deploy automated self-healing architectures and other technologies to make the environment
more maintainable.
• Design and implement organizational processes and culture that enhance product reliability, including
outage/postmortem review, quarterly state of production presentation, and production readiness review.

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 2


Estimated time: Instructional tools
4 months available:
Video lectures,
mentor-led student
community, forums,
project reviews

Software/hardware and
Flexible learning: version requirements:
The program There are no software
is flexible and and version requirements
self-paced with to complete this
suggested project Nanodegree program. All
deadlines coursework and projects
can be completed in the
Udacity online classroom.
Udacity’s basic tech
requirements can be
found at udacity.com/
tech/requirements.

*The length of this program is an estimation of total hours the average student may take to complete all
required coursework, including lecture and project time. If you spend about 5-10 hours per week working
through the program, you should finish within the time provided. Actual hours may vary.

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 3


Course 1: Foundations of Observability

In this course we will focus on what observability requires in terms of people and tools. To begin with, we
will introduce SRE, its roles and responsibilities, and how those differ from other teams (DevOps, SysAdmin,
Development). Once we establish that, we will see how SRE helps an enterprise improve and discuss the costs
associated with SRE. We will come to know the types of members of the SRE team, then end with the tool set that an
SRE team may use to be successful.

In this project, students will apply the skills they have acquired in
the Establish a Foundation in Observability course to configure a
monitoring software stack to collect and display a variety of metrics
for commonly used cloud resources including VM Scale Sets,
Course 1 Project Kubernetes service, and VMs. Additionally, students will establish
Observing Cloud and configure rules for alerting and set parameters to be notified
Resources prior to the occurrence of failures within the aforementioned cloud
resources. Students will also have the opportunity to test and
observe their own implementation of the monitoring software stack
to apply and showcase SRE methodologies and practices which can
be transferred to real-world scenarios.

LEARNING OUTCOMES

• Identify the formation of SRE in the industry


SRE Roles and • Compare the SRE scope of work and functions vs. adjacent
LESSON ONE
Responsibilities roles (DevOps, sys admin and developers)
• Explain core skills of SRE

• Identify common SRE practices incident response playbooks


• Explore enterprise workflows that can use reliability
Improving Enterprise
LESSON TWO engineering
Workflows
• Perform cost-benefit analysis of impact of SRE best
ddpractice on identified enterprise workflow for improvement

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 4


• Illustrate collaboration best practices with development
team cross-functional collaboration
LESSON THREE SRE Teams
• Define the SRE team
• Develop governance of SRE team work quality

• Install Prometheus/Grafana — Understand the installation


aasteps and out-of-the-box configuration Prometheus; Grafana
• Create a dashboard for host metrics (latency, errors,
ssresource utilization CPU/RAM Disk I/O); observability
Monitoring System
LESSON FOUR ssdashboards; site reliability metrics
Performance
• Install and configure a synthetic monitoring solution
• Create alerts for application (availability, latency) metrics,
ssmonitor an endpoint, and trigger an alert if the endpoint
s is down

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 5


Course 2: Planning for High Availability and
Incident Response
This course will cover monitoring, high availability (HA) and disaster recovery (DR), infrastructure as code,
and database recovery and availability. We start by defining SLOs and SLIs. We then take those SLOs and SLIs
and translate them into queries for Prometheus and graphs in Grafana. Next, we look at our infrastructure
overview, improve it with HA principles, and then craft a DR plan. We then take that plan and deploy it via
Terraform to multiple AWS regions. We wrap up the content by designing and deploying highly available
databases to AWS via Terraform.

In this project, students will design and deploy HA infrastructure


through Terraform and deploy it to AWS. Students will start by
Course 2 Project defining SLOs and SLIs and create a dashboard in Grafana for those
objectives. Next, they will create a disaster recovery plan and define
Deploying HA
their high-availability infrastructure. They will take what they build
Infrastructure
and form Terraform code to deploy the infrastructure to multiple
AWS regions. Finally, they will deploy replicated databases through
Terraform code to AWS.

LEARNING OUTCOMES

• Understand what SLI/SLOs are and how each relates to an SLA


• Define customer-centric SLOs
• Establish a plan on how to obtain metrics for SLOs/SLIs
LESSON ONE SLOs and SLIs
• Create SLI/SLO dashboards in Grafana which display these
metrics in a way that can be consumed by non-technical
personnel

• Determine the purpose and needs of each IT asset


• Define a plan to consolidate IT assets
IT Assets, • Create a plan to allow for high availability by selecting optimal
LESSON TWO Availability and server geography and communication
Disaster Recovery • Create a disaster recovery plan based on a designed
high-availability environment

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 6


• Add existing assets into Terraform
Create and
• Use Terraform to create identical IT assets in a different
Deploy HA and
LESSON THREE region/geography
DR infrastructure
• Given a scenario, test the recovery using the new
using Terraform
infrastructure with high-availability

High Availability • Explore log-shipping to a SQL DR instance


LESSON FOUR and DR of • Use full geo-replication for SQL databases
Databases • Create automated backups for SQL databases

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 7


Course 3: Self-Healing Architecture
In this course, students will learn how to deploy microservices or cloud architecture that is resilient enough
to withstand failures and predictable enough to resolve issues via automation without human intervention.
This framework is known as self-healing architecture.

You’ll begin by learning some self-healing system design fundamentals such as single points of failure and
three-tier architecture. Then we will show you some self-healing deployment strategies, implementation
steps, and use cases. Finally, we’ll cover some cloud automation that you can use to increase the resiliency
of systems, such as auto-scaling automation.

Students will play the role of an engineer who has just started at
a growing consulting firm called Casa de mi Padre. Due to some
unfavorable company policies, the team they were supposed to
have joined has left the company. Due to their rush to leave, the
applications they were working on were left in an undocumented,
unknown state. The company is raring to get back on pace, and
Course 3 Project
students are tasked with deploying them to the cloud. Some of
Deployment Roulette the microservices have scaling or availability issues, and some
don’t have a deployment strategy in place. It’s up to the students
to identify failing applications and implement fixes to resolve the
problems. Students will also create an architecture diagram that
communicates the status of the cloud environment to improve the
onboarding of future developers.

LEARNING OUTCOMES

• Identify single points of failure in system architecture and


Design Self- describe resolution strategies
Healing Systems • Describe three-tier architecture benefits and drawbacks
and Visualize • Describe self-healing architecture automation strategies
LESSON ONE
Them with • Describe best-practice microservice design for self-healing
Architecture architecture
Diagrams • Visualize self-healing system design by analyzing and creating
diagrams

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 8


• Describe multiple deployment strategies and their benefits
Implement and drawbacks
Self-Healing • Assess in which scenarios to use specific deployment
LESSON TWO
Deployment strategies
Strategies • Implement rolling, canary, and blue-green self-healing
deployment strategies

Implement Scaling
and Failover • Describe cloud automation for scaling and failover
Automation • Automate microservices scaling
LESSON THREE
Strategies for • Automate virtual machines scaling
High-Availability • Automate microservice cluster scaling
Applications

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 9


Course 4: Establishing a Culture of Reliability
This course is all about establishing a lasting culture focused on reliability. In this course, students will learn
how to develop processes and frameworks that will drive their workplace towards putting reliability first.
Students will begin by working through the incident management process and how to have effective on-calls.
Following that, they will learn how to perform reliability reviews on various phases of a system. Next, they will
learn how to effectively manage system capacity without being wasteful. We will round out this course with a
lesson on how to reduce toil to free up your time to focus on the work that matters.

In this project, students will be participating in several mock


scenarios they might encounter as an SRE. There will be three
scenarios, each demonstrating different skills students have learned.
In the first scenario, Release Night, students will utilize capacity
Course 4 Project management skills as well as demonstrate how to maintain an
Plan, Reduce, Repeat as-built document. In the second scenario, students will utilize
on-call best practices to have an effective and productive on-
call, completed with a post-mortem. Finally, in the third scenario,
students will develop a toil reduction plan and perform some hands-
on automation.

LEARNING OUTCOMES

• Understand and utilize incident management process


Improving On-Call • Exhibit on-call best practices to have balanced and effective
LESSON ONE
Effectiveness on-calls
• Effectively write blameless post-mortems

• Document and review reliability for new features


Performing • Maintain an as-built document
LESSON TWO
Reliability Reviews • Perform a launch review
• Analyze reliability risks based on documentation

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 10


• Perform load test
• Analyze capacity requirements
Managing System • Utilize tiered capacity to effectively manage capacity for
LESSON THREE
Capacity present, future, and emergency needs
• Mitigate capacity risks by utilizing capacity management
best practice

• Identify and measure toil


LESSON FOUR Toil Reduction • Employ common toil reduction strategies
• Develop and execute a toil reduction plan

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 11


Our classroom experience
REAL-WORLD PROJECTS
Build your skills through industry-relevant projects. Get
personalized feedback from our network of 900+ project
reviewers. Our simple interface makes it easy to submit
your projects as often as you need and receive unlimited
feedback on your work.

KNOWLEDGE
Find answers to your questions with Knowledge, our
proprietary wiki. Search questions asked by other students
and discover in real time how to solve the challenges that
you encounter.

WORKSPACES
See your code in action. Check the output and quality of
your code by running them on workspaces that are a part
of our classroom.

QUIZZES
Check your understanding of concepts learned in the
program by answering simple and auto-graded quizzes.
Easily go back to the lessons to brush up on concepts any
time you get an answer wrong.

CUSTOM STUDY PLANS


Create a custom study plan to suit your personal needs
and use this plan to keep track of your progress toward
your goal.

PROGRESS TRACKER
Stay on track to complete your Nanodegree program with
useful milestone reminders.

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 12


Learn with the Best

Nathan Anderson, MBA Travis Scotto


GLOBAL CLOUD ARCHITEC T SITE RELIABILIT Y ENGINEER

Nathan is a Certified Six Sigma Black Travis Scotto has worked in technology
Belt and has 10+ years of experience in for 10 years. He has worked in
IT in multiple industries. He is also the various infrastructure roles including
instructor for two other Udacity courses: virtualization, databases, and monitoring.
Ensuring Quality Releases and Azure As an SRE, he employs automation and
Performance. monitoring daily. He has also been an
adjunct IT instructor.

Emmanuel Apau Sonny Sevin


C T O O F M E C H A N I CO D E . I O SITE RELIABILIT Y ENGINEER

Emmanuel is the CTO of consulting firm Sonny is an SRE with a varied


Mechanicode.io, co-founder of the Black background. He dabbled in research at
Code Collective, and DC’s Technical.ly Lawrence Berkeley National Labs before
RealLIST Engineer award recipient. moving into site reliability engineering
An AWS Certified DevSecOps specialist with to have a more hands-on role. He has
12 years of experience, he has spent his been published in several computing
career developing innovative solutions using journals as well as taught introductory
DevSecOps & site reliability best practices. programming courses.

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 13


All our Nanodegree programs include:

EXPERIENCED PROJECT REVIEWERS


REVIEWER SERVICES

• Personalized feedback and line-by-line code reviews


• 1600+ reviewers with a 4.85/5 average rating
• 3-hour average project review turnaround time
• Unlimited submissions and feedback loops
• Practical tips and industry best practices
• Additional suggested resources to improve

TECHNICAL MENTOR SUPPORT


MENTORSHIP SERVICES

• Questions answered quickly by our team of


technical mentors
• 1000+ mentors with a 4.7/5 average rating
• Support for all your technical questions

PERSONAL CAREER SERVICES

C AREER SUPPORT

• Resume support
• Github portfolio review
• LinkedIn profile optimization

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 14


Frequently asked questions
PROGR AM OVERVIE W

WHY SHOULD I ENROLL?


This program is designed to help you take advantage of the growing need for
skilled site reliability engineers. Prepare to meet the demand for qualified
site reliability engineers who can respond to real-life, high-stakes workplace
challenges.

WHAT JOBS WILL THIS PROGRAM PREPARE ME FOR?


The skills you will gain from this Nanodegree program will qualify you for jobs
in several industries, as countless companies are trying to incorporate better
site reliability practices into their organizations.

HOW DO I KNOW IF THIS PROGRAM IS RIGHT FOR ME?


The course is for individuals who are looking to advance their site reliability
engineering careers with skills in a burgeoning field.

ENROLLMENT AND ADMISSION

DO I NEED TO APPLY? WHAT ARE THE ADMISSION CRITERIA?


No. This Nanodegree program accepts all applicants regardless of experience
or specific background.

WHAT ARE THE PREREQUISITES FOR ENROLLMENT?


A well-prepared learner is already able to:
• Write basic functions in an object-oriented language (Python or Java), such
as for loops, conditionals, Control Flow, Python Methods, Java Methods, etc.
• Write basic shell scripts in Bash or Powershell, which could include for
loops, conditionals, scripting, etc.
• Understand Linux command-line (bash/shell) and UNIX Shell.
• Create simple SQL queries using SELECT, JOINS, GROUP BY functions.
• Exercise networking skills including knowledge of virtual networks, DNS,
subnets, and basic network troubleshooting techniques.
• Perform DevOps tasks, such as setting up monitoring, doing feature rollout,
troubleshooting production issues, ideally for large systems.
• Work with Kubernetes and basic kubectl, such as kubectl apply,
kubectl create, kubectl config.

IF I DO NOT MEET THE REQUIREMENTS TO ENROLL, WHAT SHOULD I DO?


Students who do not feel comfortable in the above may consider taking any of
the web development Nanodegree programs (Cloud Developer, Azure Cloud
Developer, or Full Stack Web Developer).

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 15


FAQs continued
TUITION AND TERM OF PROGR AM

HOW IS THIS NANODEGREE PROGRAM STRUCTURED?


The Site Reliability Nanodegree program consists of content and curriculum to
support 4 projects. We estimate that students can complete the program in 4
months working 5-10 hours per week.

Each project will be reviewed by the Udacity reviewer network. Feedback will
be provided and if you do not pass the project, you will be asked to resubmit
the project until it passes.

HOW LONG IS THIS NANODEGREE PROGRAM?


Access to this Nanodegree program runs for the length of time specified
above. If you do not graduate within that time period, you will continue
learning with month-to-month payments. Terms of Use and FAQs for other
policies regarding the terms of access to our Nanodegree programs.

CAN I SWITCH MY START DATE? CAN I GET A REFUND?


Please see the Udacity Program Terms of Use and FAQs for policies on
enrollment in our programs.

W H AT S O F T WA R E A N D V E R S I O N S W I L L I N E E D F O R T H I S P R O G R A M ?

WHAT SOFTWARE AND VERSIONS WILL I NEED IN THIS PROGRAM?


There are no software and version requirements to complete this
Nanodegree program. All coursework and projects can be completed via
Student Workspaces in the Udacity online classroom. Udacity’s basic tech
requirements can be found at udacity.com/tech/requirements.

Need help? Speak with an Advisor: udacity.com/advisor Site Reliability Engineer | 16

You might also like