“How exactly do we define ‘uptime’?”
“How available was the FullStory platform last month?”
“Are we meeting our customers’ expectations around reliability?”
Two years ago, although we subjectively felt that FullStory was reliable, we weren't able to confidently answer these questions. But as our business matured, our product became more complex, and we began to acquire larger and larger customers, it became apparent that we needed to address availability in a holistic way if we wanted to achieve our desired scale. That’s why in early 2020, shortly after creating our Technical Program Management team, we decided to stand up a dedicated Availability program.
In this post, we’ll describe how we started and operationalized an Availability program from scratch using Site Reliability Engineering principles and the Service Level Objectives framework.
Desired Business Outcomes
As with all of our TPM programs, the first step in standing up the Availability program was defining our Desired Business Outcomes (DBOs). Written by the TPM team and agreed upon by Engineering leadership, DBOs serve as the “north star” for our programs. They help guide prioritization of individual projects within the program, and we use them as a litmus test to ensure that everything we’re working on can be tied back to key business goals.
We decided upon the following DBOs for the Availability program:
Customers can trust that all parts of the FullStory platform will be consistently available and performant.
Availability/SLOs are defined and bionically tracked across all components.
Availability is used as a barometer of our engineering quality and influences strategic planning/prioritization decisions.
We can tell the "high availability" story publicly with the numbers to back it up.
With our Desired Business Outcomes in place, we could move on to the specific projects and tasks necessary to achieve those outcomes.
Choosing a framework
We knew that we needed some sort of framework that would provide us with a common language for discussing and tracking availability. Rather than reinventing the wheel, we quickly decided to standardize on what is arguably the gold standard in the field of Site Reliability Engineering: Google's SRE book.
Google's framework includes three key, interrelated concepts: Service Level Indicators, Service Level Objectives, and Service Level Agreements.
A Service Level Indicator (SLI) is a metric that can be used to tell us something about our availability.
A Service Level Objective (SLO) is a target that we set for a particular SLI or set of SLIs. SLOs also dictate your Error Budget.
Finally, a Service Level Agreement (SLA) is a contractual agreement that dictates what happens if we fail to meet certain SLOs.
The SRE book delves into each of these concepts in greater detail.
For the initial stages of the Availability program we decided to focus on defining our most important Service Level Indicators, as well as a set of internal Service Level Objectives for which we would hold ourselves accountable. But before coming up with SLIs and SLOs, we needed to identify which pieces of the FullStory platform we wanted to measure. To achieve this, we divided FullStory into “components.”
Defining our availability “components”
A component represents a surface area of the FullStory platform from the perspective of a customer. This is an important distinction: we could’ve chosen to simply define SLIs and SLOs for every backend service within FullStory’s architecture. But instead, framing our components from the customer’s perspective helps to ensure we’re optimizing for metrics that are closer to the actual customer experience.
For example, the customer doesn't necessarily care that microservice XYZ experienced an increase in request latency. Microservice XYZ is an implementation detail of our system. However, the customer does care if FullStory's search functionality is sluggish or unusable, since that is directly tied to the value they’re receiving from the FullStory application.
Ultimately, we decided on the following components:
Web Recording: Our ability to record data from customers’ websites
Mobile Recording: Our ability to record data from customers’ mobile apps
API: Our suite of public-facing REST APIs
App: The FullStory web application (app.fullstory.com)
Processing: The data processing pipeline that takes raw recorded data and makes it available for analysis within FullStory
We also identified an individual owner for each component (usually the manager or Tech Lead of the most relevant team).
Choosing SLIs and SLOs
We then turned our attention to defining an initial set of SLIs and SLOs for each component. We held workshops with each component owner to identify the areas of each component that were 1) most directly tied to customer experience and 2) easily measurable. We considered all of the various flavors of SLI (availability, latency, freshness, durability, etc) and discussed which ones were most applicable to the given component.
For example, for the Web Recording component, we decided upon two Service Level Indicators:
An availability SLI measuring the percentage of successful requests to our `recorder` service
A latency SLI measuring the percentage of requests completing within one second
Choosing an appropriate Service Level Objective for each Service Level Indicator can be an art form unto itself. The SLO should ideally match the customer’s expectations of how the service should perform. If an SLO is too aggressive, it becomes unrealistic to achieve. On the other hand, if an SLO is too lax, you may be able to meet the SLO while still providing a poor customer experience, which is obviously undesirable. Since these were internal targets, we were able to set our initial SLOs with the knowledge that we could continue to iterate on them if they turned out to be too high or too low.
Operationalizing the Availability program
At this point, we were finally ready to transition from theoretical discussions to practical implementation of the Availability program.
Fortunately, most of the underlying data for our SLIs was already being captured in Prometheus, the time-series database that we use for monitoring. For the small number of SLIs that we weren’t already capturing, we worked with the relevant Engineering teams to implement Prometheus metrics for them. Then, the TPM team wrote PromQL queries to express each of our Service Level Indicators.
We use Grafana as our visualization layer on top of Prometheus. We created a Grafana dashboard for each availability component, allowing us to easily view our SLIs/SLOs in both a time-series and aggregate number format across any desired date range.
A portion of our Grafana dashboard for the Web Recording component
These dashboards were the key to enabling us to communicate about SLIs and SLOs to the broader Engineering organization. The main venues for this communication are the monthly “SLO Review” meetings that we established with component owners and other relevant stakeholders. These meetings are held at the beginning of each calendar month, and the agenda includes reviewing our SLO compliance for the previous calendar month, identifying areas for improvement, and checking in on open action items. We also set up weekly asynchronous check-ins to ensure that we are on track to meet our SLOs.
Avoiding vanity metrics
Another aspect of operationalizing this program was coming up with a mechanism to ensure that our SLIs are always measuring the right things. In other words, we wanted to guard against our SLIs becoming “vanity metrics” and make sure that they are always representative of the actual customer experience. To this end, we began mapping customer support tickets back to our SLIs. Each week, the TPM team reviews all of the support tickets we received in which a customer was inquiring about a performance or reliability issue. We determine if the issue being reported in each ticket is already measured by an existing SLI, or if there is the potential for us to modify or expand our SLIs to more comprehensively measure the customer experience.
The existence of the Availability program has enabled us to speak more confidently (both internally and externally) about the reliability of our platform, proactively identify issues that impact our SLOs, and cultivate an SRE mindset across the entire Engineering organization. We've also begun using SLO compliance to inform our risk tolerance and help us prioritize between new feature development and availability-related work. We'll continue to invest heavily in the program going forward, improving the comprehensiveness of our SLIs, ingraining availability into our Engineering culture, and delivering a best-in-class experience to customers of the FullStory platform.