Engineering operational work to scale with a growing application is BEST achieved by addressing which of the following issues?
Staffing levels
Interruptions
Toil
On-call rotations
Comprehensive and Detailed Explanation From Exact Extract:
One of the central goals of SRE is that operational work must scale sublinearly with service growth. The SRE Book states: “If operational load grows linearly with service size, the model is unsustainable. Eliminating toil is key to scaling operations.” (SRE Book – Chapter: Eliminating Toil). Toil prevents scaling because it is manual, repetitive, and tied directly to human effort.
Option C is the only answer that reflects this principle: reducing or eliminating toil enables SRE teams to support growing applications without increasing human labor proportionally.
Option A (staffing levels) does not scale sustainably.
Option B (interruptions) relate to productivity but not true scalability.
Option D (on-call rotations) affects fatigue, not the scaling of operational work.
Thus, C is the correct and SRE-authentic answer.
Why is observability potentially better than traditional monitoring?
Observability is less expensive than traditional monitoring
Traditional monitoring does not adapt well to the cloud since it focuses on discrete components and applications
Traditional monitoring can struggle to scale when service growth is rapid
Traditional monitoring cannot support containers
Comprehensive and Detailed Explanation From Exact Extract:
Traditional monitoring works well when systems are static and predictable. However, cloud-native, distributed, and microservice-based architectures create highly dynamic environments. In these cases, observability becomes more effective because it provides visibility across entire systems, rather than focusing on individual components.
From Google’s Observability guidance:
“Traditional monitoring relies on predefined dashboards and known failure modes. In modern cloud systems, component-level monitoring becomes insufficient because failures occur in ways that cannot always be predicted.”
Further, in the SRE Workbook:
“Monitoring individual components does not provide adequate visibility into complex distributed systems. Observability enables teams to understand system-wide behavior and user impact.”
Why options are incorrect:
A Observability is not inherently cheaper.
C While true, it is not the best reason; observability's benefit is broader than scale alone.
D Traditional monitoring can support containers but often becomes noisy and ineffective.
Thus, the best answer is B.
Following a major outage, an analysis of the outage is conducted. This BEST describes an example of which of the following?
A follow-up culture
A major incident culture
A postmortem culture
A problem culture
Comprehensive and Detailed Explanation From Exact Extract:
Google’s SRE approach emphasizes a blameless postmortem culture as a core learning mechanism. After a major outage, SRE teams conduct structured analyses to understand the root causes, contributing factors, and systemic weaknesses. The SRE Book defines this culture explicitly: “Postmortems are written analyses following incidents, designed to capture what happened, why it happened, and how to prevent the issue from recurring.” (SRE Book – Chapter: Postmortem Culture). This learning-focused approach reduces blame, increases resilience, and improves future reliability.
Option C aligns exactly with this principle.
Option A (follow-up culture) is vague and not an SRE term.
Option B (major incident culture) refers to incident handling, not learning afterward.
Option D (problem culture) is unrelated to SRE’s structured post-incident learning.
Thus, C is correct.
What is the primary difference between SRE and DevOps?
SRE is an implementation of DevOps but focuses mostly on post-production responsibilities
DevOps is mostly for software engineers and SRE is mostly for infrastructure engineers
DevOps encourages closer collaboration between development and operations whereas SRE is about building a silo around production operations
DevOps and SRE are the same thing
Comprehensive and Detailed Explanation From Exact Extract:
The primary difference between SRE and DevOps lies in their implementation focus and origins, though they share similar objectives. According to Google’s official SRE documentation:
“SRE can be seen as a specific implementation of DevOps with some idiosyncratic extensions.”
— Site Reliability Engineering Book, Chapter: What is Site Reliability Engineering?
While DevOps is a broad cultural and organizational philosophy aimed at closing the gap between development and operations through collaboration and automation, SRE provides a concrete, engineering-driven approach to achieving those goals — particularly through practices like error budgets, SLIs/SLOs, toil reduction, and incident response.
SRE focuses heavily on the post-production lifecycle — including reliability, monitoring, capacity planning, and incident response — whereas DevOps includes these concerns but emphasizes the entire software delivery lifecycle. Hence, Option A is the correct and most accurate answer.
Options B and C are incorrect:
B wrongly implies a division of roles (DevOps = developers, SRE = infrastructure), which is not how these frameworks operate.
C misrepresents SRE — it does not build silos but instead emphasizes shared responsibility and transparency in production systems.
D is incorrect because, while aligned, SRE and DevOps are not identical.
“Problem-solving with a group of people with different skillsets.”
Which of the following concepts is BEST inferred by the above statement?
Coordination
Collaboration
Communication
Cooperation
Comprehensive and Detailed Explanation From Exact Extract:
The SRE model heavily emphasizes cross-functional teamwork. In the SRE Workbook and chapters addressing incident management, Google defines collaboration as “bringing together individuals with diverse expertise to jointly solve problems and make decisions.” Collaboration implies active engagement, shared goals, and joint execution—exactly what the statement describes.
Option B, Collaboration, fits perfectly because effective problem-solving during incidents, launches, or reliability engineering work requires engineers from multiple disciplines (e.g., SRE, developers, network teams, product teams) to work together directly.
Option A (Coordination) is more about task alignment, not joint problem-solving.
Option C (Communication) is necessary but insufficient for solving problems together.
Option D (Cooperation) implies helpfulness, not necessarily integrated problem-solving.
Thus, B is the correct concept.
What metrics will embracing failure help to improve?
Mean time to detect and mean time between system incidents
Change lead time and change failure rate
Empirical test data and mean time to recover service
Mean time to detect and mean time to recover
Comprehensive and Detailed Explanation From Exact Extract:
Embracing failure—through practices such as blameless postmortems, chaos engineering, and proactive detection—enables organizations to improve their incident response performance. This directly improves:
MTTD (Mean Time to Detect)
MTTR (Mean Time to Recover)
The Site Reliability Engineering Book, chapter “Postmortem Culture,” states:
“By examining failures without blame and learning from them, organizations improve their ability to detect issues faster and recover more quickly.”
Similarly, in the SRE Workbook, section on incident response:
“Learning from incidents is essential to reducing time to detection and time to mitigation.”
Why the other options are incorrect:
A MTBSI (Mean Time Between System Incidents) is influenced by architecture and testing, not directly by embracing failure.
B These are DORA metrics — important, but not primarily tied to failure-embracing practices.
C Too vague and not a standard SRE metric pair.
Thus, D is the correct answer.
Which of the following BEST describes capacity planning?
Monitoring the percentage of capacity of resources being used over a time period
Activities performed to manage provider resources and provide multiple services
Activities used to create a plan that manages resources to meet service demand
Determining the maximum amount that any resource can accommodate or deliver
Comprehensive and Detailed Explanation From Exact Extract:
SRE defines capacity planning as the discipline of ensuring that a system has enough resources to meet expected demand, both now and in the future. The SRE Book states: “Capacity planning ensures that services have sufficient resources available to meet reliability and performance targets, accounting for growth, trends, and forecasted usage.” (SRE Book – Chapter: Capacity Planning). This involves forecasting workloads, analyzing trends, and creating plans to scale infrastructure so that service-level objectives can continue to be met.
Option C correctly describes capacity planning as creating a resource management plan to meet demand.
Option A refers to capacity monitoring, not planning.
Option B reflects generic resource management or cloud provider operations, not SRE capacity planning.
Option D refers to determining maximum capacity, which is a measurement activity—not full planning.
Thus, C is the correct SRE-aligned answer.
A bank has been using traditional monitoring tools for ensuring that their systems are available and operating as planned. Their strategic initiatives now include a renewed focus on customer experience as well as identifying ways to scale service.
Why would migrating to an observability approach be important now?
It’s better for managing container workloads and dynamic architectures
Monitoring at the component level may no longer provide the right data
It is impossible to anticipate all potential problems
All of the above
Comprehensive and Detailed Explanation From Exact Extract:
All the listed reasons correctly describe why observability becomes essential in modern, user-focused, dynamically scaling architectures.
The SRE Workbook and Google Observability guidance both emphasize that traditional monitoring is insufficient in environments where:
Services are distributed
Traffic is unpredictable
Customer experience is a priority
Cloud-native, containerized, or microservice architectures are used
Key excerpts:
From Google’s Observability guidance:
“Monitoring relies on known failure modes; observability enables teams to explore unknown-unknowns and understand complex, dynamic systems.”
From the SRE Workbook:
“As systems scale and architectures shift toward microservices or containers, component-level monitoring provides an incomplete picture. Observability enables teams to understand user impact and system behavior holistically.”
Thus:
A Observability is critical for containerized and dynamic environments.
B Component monitoring alone cannot show customer experience or end-to-end reliability.
C Observability helps teams diagnose issues that could not be predicted in advance ("unknown unknowns").
All statements are correct, making D the correct answer.
Where should an organization store versioned and signed artifacts that are used to deploy system components?
In the Configuration Management System (CMS)
In a Subversion source code repository
In a Definitive Media Library (DML)
In a secure artifact repository
Comprehensive and Detailed Explanation From Exact Extract:
SRE and modern DevOps best practices require that build artifacts—such as binaries, container images, and deployment packages—be stored in a secure, versioned artifact repository. These repositories ensure integrity, traceability, immutability, and security of deployment packages.
While the SRE Book does not use the ITIL term DML, it emphasizes:
“All production binaries should be stored in a secure, versioned repository to ensure consistent, repeatable, and trustworthy deployments.”
— Site Reliability Engineering Book, section on Release Engineering
The SRE Workbook expands on this principle by emphasizing signed and verified artifacts:
“To ensure safe rollout, artifacts must be built once, stored securely, signed, versioned, and deployed from a controlled artifact repository.”
Why the other options are incorrect:
A A CMS manages configuration, not deployment artifacts.
B Subversion is a source code repository, not an artifact repository.
C A DML is an ITIL concept, but SRE practice does not rely on it; instead, SRE uses modern artifact repositories (e.g., GCR, ACR, Artifactory).
Thus, the correct answer is D.
Reliability is a key pillar of digital experience monitoring and incident management.
Which of the following describes the BEST type of reliability monitoring strategy in SRE?
A strategy that uses traditional and familiar monitoring tools rather than advanced artificial intelligence
A strategy that instruments observability and provides monitoring insights across all components and layers
A strategy that focuses on monitoring and discovering useful patterns in the performance of all active networks
A strategy that harnesses advanced technologies to measure, analyze, and maintain the fitness of applications
Comprehensive and Detailed Explanation From Exact Extract:
SRE defines effective monitoring as comprehensive observability across all layers of a system, including latency, traffic, errors, saturation, dependencies, and infrastructure. The SRE Book states: “Monitoring must offer insight across all system components, enabling teams to rapidly detect and diagnose issues.” (SRE Book – Monitoring Distributed Systems). Observability instrumentation (logs, metrics, traces) provides the necessary depth for reliable digital experience monitoring.
Option B captures this exactly: broad observability across all components and layers.
Option A rejects modern observability practices—contradicting SRE guidance.
Option C is too narrow (network-only).
Option D focuses only on advanced technologies, not comprehensive coverage.
Thus, B is the best answer.
Which of the following describes work that would be considered "toil"?
Work that is devoid of enduring value
Work that has some enduring value but requires manual tasks
Engineering work to add service features
Engineering work that does not add enduring value
Comprehensive and Detailed Explanation From Exact Extract:
“Toil” in SRE has a very specific meaning. According to the Site Reliability Engineering Book, Chapter “Eliminating Toil”:
“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, has no enduring value, and scales linearly as the service grows.”
The key phrase is “no enduring value.” Toil does not produce lasting improvement, even though it may be necessary in the short term. It consumes engineering effort without making the system better over time.
Why the other options are incorrect:
B Work that has some enduring value cannot be classified as toil by definition.
C Engineering work that adds service features is explicitly non-toil, because SRE defines feature work as “project work,” not operational toil.
D Seems close but is misleading: engineering work without enduring value is poor engineering, not necessarily toil. Toil refers to operations workload specifically.
Thus, A is the correct and precise definition of toil.
Which of the following is BEST described as the role responsible to maintain the live incident state document?
The logistics specialist
The communications lead
The planning specialist
The incident commander
Comprehensive and Detailed Explanation From Exact Extract:
In SRE incident management, Google defines several formal roles during a major incident, including Incident Commander (IC), Communications Lead, Operations/Responder, and Planning Specialist. According to the SRE Workbook: “The Planning Lead is responsible for maintaining the source-of-truth incident state document, tracking action items, and ensuring the IC has the current situation overview.” (SRE Workbook – Chapter: Incident Management). This document contains timelines, changes, decisions, diagnostics, and action items—all crucial for reducing cognitive load during high-stress situations.
Option C—Planning Specialist—is therefore correct.
Option A (Logistics Specialist) is not defined as a core SRE incident role.
Option B (Communications Lead) manages outward communication, not the live incident log.
Option D (Incident Commander) leads the incident but delegates documentation to the planning role.
Hence, option C is the only answer that aligns with SRE’s defined responsibilities.
What types of outages must fit into an Error Budget?
Unplanned incidents
Defect fixes
Any planned or unplanned outage
Any change approved by the CAB or decision authority
Comprehensive and Detailed Explanation From Exact Extract:
An error budget accounts for all downtime, including both planned and unplanned outages. This is a critical SRE principle: the user does not distinguish between maintenance downtime and accidental downtime — therefore, neither should the SLO nor the error budget.
The SRE Book, Chapter “Service Level Objectives,” states:
“From the user’s perspective, availability is simply whether the service is working or not, regardless of whether the outage was planned or unplanned.”
This means all downtime counts toward the error budget.
Additionally, the SRE Workbook reinforces this point:
“Error budgets must include every form of unavailability — maintenance events, configuration changes, emergency work, and unexpected incidents.”
This confirms that planned outages (maintenance windows) and unplanned outages (incidents) both consume error budget.
Why the other options are incorrect:
A Only includes unplanned incidents; SRE requires counting planned outages as well.
B Defect fixes may contribute to downtime, but “defect fixes” alone are not a downtime category.
D CAB approval has no bearing on whether outages count toward error budgets.
Thus, C is correct: any planned or unplanned outage must be included.
Known workarounds represent what type of toil?
Linear scaling
Tactical
Automatable
No enduring value
Comprehensive and Detailed Explanation From Exact Extract:
Known workarounds represent toil that has no enduring value, one of the key characteristics of toil defined by the SRE framework.
From the Site Reliability Engineering Book, Chapter “Eliminating Toil”:
“Toil is work that is manual, repetitive, automatable, tactical, has no enduring value, and scales linearly with service size.”
Known workarounds fit this definition because:
They solve the same recurring problems repeatedly
They do not permanently fix the underlying issue
They consume engineer time without contributing long-term improvements
These activities lack enduring value and should be eliminated through automation or engineering fixes.
Why the other options are incorrect:
A. Linear scaling — Many forms of toil scale linearly, but this does not specifically describe workarounds.
B. Tactical — Tactical means short-term, but not all tactical work is a workaround.
C. Automatable — While some workarounds can be automated, not all are.
D. No enduring value — This is the defining trait of workaround-type toil.
Therefore, option D is correct.
Which of the following BEST describes the most important rationale for NOT seeking an SLO of 100% availability?
It is not realistic for the complexity and scale of services.
The likely result is failure where such targets are defined.
There is no room for improvements if targets are so high.
The user satisfaction score is affected by a low percent.
Comprehensive and Detailed Explanation From Exact Extract:
The SRE Book clearly states: “A target of 100% availability is neither realistic nor economically viable at scale.” Complex distributed systems inherently experience failures, network issues, hardware faults, and dependency outages. SRE emphasizes embracing this reality through error budgets, which assume some failure and allow engineering resources to be used efficiently.
The primary reason not to set 100% availability is that it is impossible to achieve reliably and leads to wasted engineering effort. SRE states: “Chasing perfect reliability leads to dramatically increasing costs with diminishing returns.”
Option A captures this rationale precisely.
Options B, C, and D are secondary or incorrect interpretations and do not come directly from SRE principles.
Thus, A is the correct SRE-aligned answer.
Which of the following is a principle of SRE-Led Service Automation?
No automated tests in production
Environments provisioned using IaC
Using unsigned artifacts in production
Adding as much hardware as possible
Comprehensive and Detailed Explanation From Exact Extract:
SRE-led service automation focuses on making environments reproducible, reliable, and consistent. One of the key principles aligned with Google SRE practices is the use of Infrastructure as Code (IaC), which allows environments to be provisioned automatically, consistently, and predictably.
The Site Reliability Engineering Book, in its discussions on automation, states:
“Automation implemented as code ensures that environments are consistent, repeatable, and less prone to human error.”
The SRE Workbook expands on this concept:
“Infrastructure as Code allows services to scale and evolve reliably by ensuring that configuration and infrastructure changes are automated and version-controlled.”
IaC is fundamental to:
Reducing toil
Increasing reliability
Enabling consistent automation across environments
Reducing configuration drift
Why the other options are incorrect:
A SRE supports testing in production; it does not ban automated tests.
C Using unsigned artifacts violates security and reliability best practices.
D Adding hardware is not an automation principle and contradicts efficiency goals.
Thus, the correct answer is B.
Which of the following is the MOST accurate description of Kubernetes?
A proprietary system developed to automate the integration, building, testing, and deployment of application containers
An independent platform that enables organizations to implement continuous integration and delivery practices
A platform used to manage containers in a cloud environment and also includes automated scaling and failover
An open-source operating system on which containerized applications can be run, monitored, and managed efficiently
Comprehensive and Detailed Explanation From Exact Extract:
Kubernetes is described in SRE-aligned literature as an open-source container orchestration platform that automates deployment, scaling, failover, and lifecycle management of containerized applications. The Site Reliability Workbook references Kubernetes as: “a container management system that automatically handles service discovery, scaling, rollout management, and self-healing.” (SRE Workbook – Production Environment chapters). Kubernetes does not replace an OS, nor is it a CI/CD platform; it sits on top of an OS and orchestrates containers across clusters.
Option C is the most accurate: it captures container management, cloud deployment context, automated scaling, and failover—key capabilities of Kubernetes.
Options A and B incorrectly describe CI/CD platforms.
Option D incorrectly labels Kubernetes as an “operating system.”
Thus, C is correct.
Which of the following features of Puppet Labs is described as the ability to locate, identify, and group cloud nodes?
Provisioning
Delivery
Discovery
Insight
Comprehensive and Detailed Explanation From Exact Extract:
In the context of SRE tooling and automation, configuration management platforms like Puppet support large-scale infrastructure reliability by enabling consistency, repeatability, and automation. Puppet’s Discovery capability allows engineers to automatically locate, identify, classify, and group cloud nodes or infrastructure resources. Although not directly from Google’s SRE Book, Discovery aligns with SRE principles of reducing toil and enabling scalable automation. SRE emphasizes “automating away the manual work of locating and managing infrastructure at scale.” (SRE Book – Chapter: Eliminating Toil). Puppet Discovery does precisely this by automatically scanning environments, detecting nodes, and providing metadata to group or manage them.
Option A (Provisioning) refers to creating infrastructure, not identifying it.
Option B (Delivery) relates to CI/CD processes.
Option D (Insight) relates to analytics and reporting, not node identification.
Therefore, C. Discovery is correct as it directly represents the capability described.
What is the goal of SRE?
To spend 50% of a SRE's time on operational tasks and 50% of the time on development tasks to reduce toil
To ensure that Service Level Objectives are consistently met through monitoring and observability
To create highly reliable post-deployment operational systems that align with DevOps and Agile
To create ultra-scalable and highly reliable distributed software systems
Comprehensive and Detailed Explanation From Exact Extract:
The goal of Site Reliability Engineering (SRE) is to create ultra-scalable and highly reliable distributed software systems. This principle is clearly articulated in the foundational text of SRE, the Google Site Reliability Engineering book.
From Chapter 1: Introduction of the Site Reliability Engineering book:
"SRE is what happens when you ask a software engineer to design an operations team. Our approach to service management is rooted in our belief that engineering work to create scalable and highly reliable systems is critical to the success of modern software."
— Site Reliability Engineering Book, Chapter 1
This statement establishes that building and maintaining scalable, reliable systems is the core mission of SRE. While concepts like reducing toil (option A), implementing SLOs (option B), and aligning with DevOps (option C) are vital components of the SRE practice, they support the overarching goal — which is option D.
Therefore, the correct answer is D: To create ultra-scalable and highly reliable distributed software systems.
Identify the defense-in-depth (DiD) layer where data flows in from, and out to, other networks, including the Internet.
Host layer
Physical layer
Perimeter layer
Data layer
Comprehensive and Detailed Explanation From Exact Extract:
Defense-in-Depth (DiD) is a layered security strategy referenced in SRE’s discussions of secure infrastructure and resilience. The perimeter layer is responsible for controlling and monitoring traffic flowing into and out of the network from external sources, such as the public Internet. This includes firewalls, intrusion detection systems, load balancers, and boundary network controls.
While SRE focuses primarily on reliability, the SRE Book stresses the importance of resilient system boundaries: “Perimeter protections are critical where external traffic enters the system.” (SRE Book – Security and Infrastructure considerations).
Option C correctly identifies the Perimeter Layer as the network boundary where data flows in/out from other networks—including the Internet.
Option A (Host layer) secures individual machines.
Option B (Physical layer) refers to hardware, power, racks, etc.
Option D (Data layer) protects stored data, not ingress/egress traffic.
Thus, C is correct.
Identify the missing word(s) in the following sentence:
Site reliability engineering is a _________ approach to IT operations.
structural engineering
security engineering
software engineering
simulation engineering
Comprehensive and Detailed Explanation From Exact Extract:
Google’s SRE definition is explicit: “Site Reliability Engineering is what happens when you ask a software engineer to design an operations team.” (SRE Book – Introduction). This clearly defines SRE as a software engineering approach applied to operational problems. The goal is to use software techniques—automation, coding, testing, version control, CI/CD, observability—to improve reliability and reduce toil. The book emphasizes: “SRE applies software engineering to operations work.” (SRE Book – What Is SRE?).
Option C is the only answer fully aligned with the official definition.
Options A, B, and D do not correspond to the SRE definition provided by Google.
Thus, the correct missing phrase is software engineering.
In a safety culture, engineers are allowed to do more with the production environment without fear of repercussions.
What else do engineers need to do?
Share production incidents on social media
Be accountable for their actions
Skip all blameless post-mortems
Avoid being on-call
Comprehensive and Detailed Explanation From Exact Extract:
In a safety culture, SRE emphasizes psychological safety so engineers can work effectively in production without fear of blame. However, safety never removes accountability. Engineers must take responsibility for their actions, decisions, and assumptions, particularly during incidents.
The Site Reliability Engineering Book, Chapter “Postmortem Culture,” states:
“Blamelessness does not eliminate accountability. Individuals must still explain the context, assumptions, and reasoning behind their decisions so that the organization can learn.”
Google stresses that:
Engineers must feel safe to act and report issues
Engineers must remain responsible and accountable
Accountability enables learning, not punishment
Why other options are incorrect:
A Sharing incidents on social media violates confidentiality
C Blameless postmortems are required, not skipped
D Avoiding on-call is contrary to SRE responsibilities
Thus, B is correct.
Service Level Objectives (SLOs) are tightly related to
User experience
Management approval
Change success rate
Toil reduction
Comprehensive and Detailed Explanation From Exact Extract:
Service Level Objectives (SLOs) are directly tied to user experience, and this connection is central to the SRE philosophy. The purpose of an SLO is to define how well a service must perform to keep users satisfied, without exceeding what is necessary or economically practical.
The Site Reliability Engineering Book, Chapter “Service Level Objectives,” states:
“The most important directive when defining SLOs is that they must reflect the expectations and needs of the users of the service.”
Similarly, the SRE Workbook, Chapter “Implementing SLOs,” highlights:
“SLOs are a tool to measure and control the reliability as experienced by the user.”
This makes it clear that SLOs are fundamentally user-centric. They are not based on internal engineering preferences, management goals, or operational convenience.
Why the other options are incorrect:
B. Management approval — SLOs are not driven by management goals but by user needs.
C. Change success rate — While related to reliability practices, change success is not the basis of SLO creation.
D. Toil reduction — Toil is unrelated to defining service-level targets.
Therefore, the correct answer is A.
Which of these approaches can alleviate linear scaling toil?
Manual scaling of services
Using auto-scaling capabilities
Outsourcing development
Switching cloud providers
Comprehensive and Detailed Explanation From Exact Extract:
Linear-scaling toil refers to work whose effort increases proportionally to service growth, such as manually provisioning servers or handling capacity expansion. The Google SRE Book, Chapter “Eliminating Toil,” explains:
“Toil is work that scales linearly with the size of your service. A core strategy for reducing toil is to introduce automation that breaks the linear relationship.”
Auto-scaling capabilities directly address linear-scaling toil by automating resource allocation based on load or demand. This prevents engineers from repeatedly and manually adjusting infrastructure as usage grows.
The SRE Workbook also emphasizes:
“Infrastructure automation such as auto-scaling removes a major source of linear scaling toil by ensuring that capacity adjusts automatically as services grow.”
Why the other options are incorrect:
A Manual scaling is linear-scaling toil, not a solution.
C Outsourcing development does not reduce operational toil.
D Switching cloud providers alone does not solve toil unless automation is introduced.
Thus, B is the correct answer.