Technical Program Management: Why We Started a TPM Team

The role of Technical Program Manager, once a niche title found mostly at large, established companies, has become considerably more mainstream over the past several years. Companies of all sizes have begun to recognize the value that this unique role can add to their Engineering organization, especially as their team matures and their technical and organizational complexity skyrockets. The list of well-known, high-performing software companies currently leveraging TPMs is a long one and includes the likes of Google, Amazon, Slack, Square, Stripe, Lyft, and Dropbox. In this post we'll discuss why we decided it was time to introduce the TPM role at Fullstory, our specific brand of TPM, and some of the results we've seen since starting the team in February 2020.

Picture It: Fullstory, 2019

We first started seriously discussing the idea of TPM at Fullstory during the second half of 2019. At that time, there were a handful of motivating factors at play.

Our Product Engineering team was growing in every conceivable way. For one thing, there were more people! I joined Fullstory in April 2018 when there were ~20 product engineers. By Q4 2019, that number exceeded 50. But while the influx of new coworkers was perhaps the most obvious sign of growth, it was actually the less-tangible aspects of our organization that were evolving the fastest. Our product and platform were changing to support new use cases and customer types. Our backend infrastructure was scaling to handle billions of requests per day. Our historically flat and fluid org structure became untenable, and it gave way to a more traditional model of teams and sub-teams. Work naturally became a bit more siloed, and projects became inherently cross-functional. And we were collaborating more closely with our counterparts outside of Engineering. It became clear that our technical and organizational complexity was only going to keep increasing, and we needed to counterbalance those forces if we wanted to maintain our velocity and continue shipping high-quality, secure, and performant software to our customers.

This brought us to the concept of technical "programs." For our purposes, a program is a set of activities or processes that contribute toward some shared business outcome. Things like Availability, Cost Optimization, Continuous Delivery, Incident Response, and Engineering Velocity are all examples of technical programs. Since these initiatives are more infrastructural and not directly tied to hands-on product development, investing in them is often deprioritized. However, we recognized that executing these things well was critical to achieving our desired scale.

But while we understood the importance of making early investments in these sorts of programs, we also knew that it would be difficult to get traction on these cross-cutting initiatives within the structure of a traditional org chart. Virtually every program would require working across team boundaries (including outside of Engineering) to achieve results. For example, consider a hypothetical Availability program with the goal of ensuring that we’re always delivering a performant and reliable product experience to our customers. Of course, we’d need to work with Engineering teams to define and track availability metrics across the product. But we’d also need to collaborate with Customer Experience (how do we communicate to customers in the event of an outage?), Sales (how do we tell our high-availability story to prospects?), and Legal (how do we incorporate availability SLAs into our contracts?). From the perspective of the traditional org chart, the only “common ancestor” who would be well-positioned to coordinate this hypothetical Availability program would be someone in the C-suite.

Stakeholders for a hypothetical Availability program in a traditional org chart

This is definitely not a scalable model! It was clear that we instead needed a dedicated function that could have holistic ownership and accountability for building, running, and optimizing these programs.

Enter TPM!

The mandate of the Technical Program Manager is to build and operate high-impact, cross-functional programs that span across the entire organization. TPMs have complete ownership of their programs and the associated business outcomes, and are empowered to work across teams and break down silos in order to achieve those outcomes.

Adding the TPM role unlocks new organizational dynamics that wouldn’t be possible with a traditional org chart. In our Availability example above, rather than needing a C-suite executive to coordinate between all the stakeholder teams, that responsibility can be given to the TPM.

Introducing TPM adds flexibility and removes C-suite from the critical path

While the TPM may live within Engineering from a reporting standpoint, they also have a “dotted line” to work directly with any stakeholder teams that are relevant to their programs.

An important nuance here is that TPMs don’t have managerial authority over the teams with whom they’re collaborating. While on the surface this may seem like a hindrance, this is actually a positive thing! It means that a TPM can’t simply make teams do something; instead, the TPM must rely on one of their core competencies: aligning incentives. They must deeply understand each stakeholder’s perspective and motivations, map them to the desired outcomes, and cultivate buy-in. This in turn means that the stakeholders will be more invested in the work, which sets the program up for sustainable long-term success.

Another key point is that while TPMs will often create new processes as a means to add structure to their programs, the job of a TPM is not to be a bureaucrat. In fact, good TPMs have a strong aversion to unnecessary bureaucracy, and they have a knack for determining how much process is “just enough” to add value without introducing more red tape.

TPM at Fullstory

TPM is not a “one size fits all” role. There isn't a standard job description or profile of an ideal TPM, and a TPM who is extremely successful and effective in one organization may struggle in another. While there are certainly some common traits (ability to work cross-functionally, communication skills, attention to detail, etc), the specific skillset required of a TPM will depend heavily on the size, maturity, and composition of the organization. The types of programs they’ll be working on are also a factor, as certain domain expertise may be beneficial. The way that we’ve chosen to define and implement TPM at Fullstory may not be right for everyone, but it works well for us at our current stage.

The most distinguishing characteristic of our TPM team is that we consider the "T" in TPM to be of paramount importance. At Fullstory, a TPM’s technical acumen is their biggest force-multiplier. It’s what enables them to ask the right questions, anticipate risks, constructively challenge opinions, and contribute meaningfully to engineering discussions. We want our TPMs to have the requisite engineering background (or equivalent practical experience) to be able to get deep into the weeds of the technologies and services that underpin their programs.

If the TPM role is ultimately about achieving business outcomes, why do we place such a strong emphasis on the technical aspect? In part, it’s because the programs themselves are deeply technical and require the TPM to have foundational software engineering knowledge in order to add value.

But it’s also about self-sufficiency. With a small and nascent TPM team, success hinges on the ability to move quickly and independently. While our Engineering team has experienced tremendous growth over the past couple of years, it's still not massive, and engineers are a scarce resource. If a TPM needs to understand the inner workings of some part of our application, they could certainly ping some engineers and wait for a response. But if the TPM is able to search through the codebase and find the answer themselves, this saves time for everyone. Similarly, if a TPM wants some improved tooling or a new data pipeline to help accelerate their program work, they could request some engineering bandwidth to implement it. This would need to be prioritized against the rest of the team's roadmap, and it may or may not be completed in a timely manner. But if the TPM can just write the code themselves, then they can operate self-sufficiently, unblocking their own work while also preserving precious engineering time.

Early Results

So, how's it going so far? While our TPM team is still very small, there has been no shortage of work to do. Here’s a quick tour of some of our TPM-led programs and the results we’ve seen thus far.

Engineering Velocity

In order to continue delivering value to our customers, we must be able to ship software quickly and confidently. The Velocity program is dedicated to ensuring that our speed doesn’t suffer as we scale. One of the TPM team’s first undertakings as part of this program was to manage our transition from a monolithic deployment process to “independent deployments” of individual services. Historically, Fullstory had a single weekly deploy for all of the various microservices within our architecture. This served us well for quite a while, but as our organization matured and the number of services grew, it became increasingly painful. Teams were feeling bottlenecked by the release schedule. Iterating on the product was challenging because there could be up to a week’s delay from the time a commit was merged to when it was actually deployed into production. For these reasons and more, we ultimately decided it was time to migrate to independently-deployed (ID) services.

The TPM team was responsible for making sure that all of our approximately 100 microservices were migrated on time. We built a tracking system so that the entire organization could transparently see the migration status of all services. We wrote code to improve our admin tooling to better support ID workflows. We held reviews with service owners to ensure that each service conformed with ID best practices. And we hosted a Lunch & Learn to inform the Engineering team and answer questions about the initiative. By June 2020, we had achieved 100% of our ID migration goal. We’ve already begun to see the payoff of this effort- the number of production deployments has exploded in 2020, indicating that teams are deploying their services at-will and are no longer bottlenecked by the weekly release.

Increase in production deployments as a result of our Independent Deploys initiative

Cost Optimization

As a SaaS company, it’s no surprise that cloud infrastructure costs are an important business consideration for Fullstory. It’s critical that we maintain an accurate picture of our cloud expenditures and that we can predict our future spend with high confidence. The TPM team has led the charge to make cost data transparently available to product teams for the first time. In the past, there was no easy way for service owners to determine how much their services actually cost to operate. This meant that Engineering leaders were unable to take cost into account when making prioritization and tradeoff decisions. To remedy this, the TPM team implemented an ETL pipeline from BigQuery (our data warehouse) into Looker (our business intelligence tool), enabling straightforward cost visualizations on a per-service basis.

Service cost visualization in Looker

We also implemented Slack alerts to inform us if a particular service experiences a day-over-day, week-over-week, or month-over-month spike in costs.

Automated cost spike alerting in Slack

This newfound visibility has helped us validate the impact of recent cost-reduction measures, identify misbehaving services, and make more confident commitments about our future cloud spend.

Availability

Our customers must be able to trust that all parts of the Fullstory platform are consistently available and performant. But it’s impossible for us to engineer our systems to be “highly available” if we haven’t first defined what availability means to us. We also need both the technical mechanisms and the organizational processes to track our availability and identify areas for improvement. To get us there, the TPM team took charge of bootstrapping our Availability program.

First and foremost, we decided that we needed to take a customer-centric view of availability. Fullstory is available only when we’re meeting our customers’ expectations about the performance of the platform, regardless of our internal opinions. Customers also don’t necessarily care about the specifics of our backend architecture. Therefore, rather than defining availability at a microservice level, we instead defined five availability “components” which represent the customer-facing surface areas of our platform.

These are:

Web Recording (our ability to reliably record customer websites)
Mobile Recording (our ability to reliably record customer mobile apps)
App (the Fullstory web application)
API (the Fullstory REST API)
Processing (behind-the-scenes processing of ingested customer data)

These components provide a common language that we can use to discuss availability across the organization.

From there, we had to decide how to actually measure availability for each of these components. We chose to adopt the framework of Service Level Indicators and Service Level Objectives, which is described at length in Google’s definitive book on Site Reliability Engineering. The TPM team held workshops with each component owner to formulate an appropriate set of SLIs and SLOs.

Finally, we needed to visualize and track our performance against these SLOs. We implemented per-component dashboards in Grafana which display each SLI and the corresponding SLO performance over a specified time range.

SLO dashboard for one of our availability components

Now that this foundational work is complete, the real fun begins! We can start to hold periodic SLO reviews with component owners and ensure all of our metrics remain in the green. Our ultimate goal is that these SLOs will serve as a barometer of engineering quality and help guide some of our strategic prioritization decisions. For example, if a component’s SLOs are all being met, perhaps we can be a bit riskier with trying new things. But if we’re not meeting our customers’ expectations, then it may be time to temporarily press pause on feature development and dedicate some bandwidth to improving availability.

Looking Ahead

This is just the beginning of the journey for Technical Program Management at Fullstory. I’m excited to build on our existing programs, spin up brand-new ones, and help propel Fullstory Engineering into our next stage of growth.

If you have any questions or comments, feel free to reach out! You can drop me a line at ians [at] fullstory.com. Thanks for reading.