Build a Scalable Feature Engineering Pipeline with Polars

Machine learning has revolutionized how we understand user behavior, but its success depends on meaningful data representation, not just sophisticated algorithms. In a previous post, we laid the groundwork by transforming raw Fullstory event data into structured, analyzable tables with dbt. Now, we're taking the crucial next step: feature engineering.

Think of feature engineering as translation work—we're turning users' digital body language into a format that machine learning (ML) models can understand.

In this blog post, we’ll teach you how to convert organized event streams into powerful, ML-ready features using the high-performance Polars library and how to build a scalable pipeline that transforms every click, scroll, and pause into predictive insights you and your model can use.

The power of behavioral data in machine learning

Before we dive into the technical details of feature engineering, let's talk about why it matters. The structured Fullstory data we created in our previous post doesn’t just keep things organized; it helps you unlock powerful predictive capabilities that drive real business value.

Here's what becomes possible when you transform user behavior into machine learning-ready features:

User Segmentation with Clustering: Group users into distinct personas (e.g., "power users," "window shoppers," "at-risk customers") based on their behavior, allowing for highly targeted marketing and product experiences.
Marketing attribution models: Go beyond last-click attribution and understand the entire customer journey. By analyzing the sequence of events leading to a conversion, you can identify which channels and campaigns are truly driving value.
Churn prediction: Identify users who are showing signs of disengagement before they leave for good, giving you a chance to proactively re-engage them.
Fraud detection: Spot unusual patterns of behavior that might indicate fraudulent activity, protecting your business and your customers.

What makes Fullstory particularly powerful for these applications is its ability to validate behavioral patterns through session replay. Instead of making educated guesses about what certain event patterns mean, you can watch real user sessions to understand the true story behind the data.

But turning these insights into production-ready features demands a systematic approach that can scale. We need a pipeline that not only captures nuanced behavioral patterns but also handles growing data volumes efficiently while supporting rapid experimentation. To build this type of robust pipeline, we'll first need to define exactly what we mean by a “behavioral feature” and what makes it valuable for machine learning.

What is a feature?

All of the predictive applications mentioned above depend on a crucial foundation: well-crafted features that accurately represent user behavior. But what exactly do we mean by a "feature" in the context of behavioral data?

A feature is a measurable, numeric representation of user behavior derived from event sequences. Think of it as transforming the story of how users interact with your product into specific signals that machine learning models can understand. These signals can take many forms:

Simple presence or absence of key actions (did the user visit the pricing page?)
Counts and frequencies (how many times did they click the 'Save' button?)
Temporal patterns (how long between starting a task and completing it?)
Interaction sequences (do users typically check pricing before or after viewing features?)
Relationships between events (does frequent search correlate with successful purchases?)

Creating meaningful features from behavioral data isn't straightforward. We need to capture complex patterns while accounting for the temporal nature of sessions and the intricate relationships between interactions. This requires both data science expertise and robust engineering. Let’s further explore these two key dimensions: the data science "what" (defining meaningful features) and the engineering "how" (building an efficient, scalable pipeline).

Dimension 1: What features do we actually need?

In our previous post, we worked with a simplified Fullstory dataset to demonstrate the basic concepts of data structuring. We’ll pick back up from the endpoint of the previous post, assuming we’ve run a dbt pipeline to create this raw features . Remember, our goal is to predict whether a user will make a purchase, a binary classification problem, represented by the has_not_purchased target variable in our dataset. Here's what our dataset looks like:

Each event carries rich contextual information. For example, source properties tell us about the user's environment:

And event properties capture specific interactions:

While this data is incredibly detailed, not all of it will help predict user behavior. To transform this rich behavioral data into purchase predictions, we'll focus on three key feature categories:

1. Numeric features: event aggregations (counts, averages, etc.)

2. Binary features: whether specific events occurred

3. Categorical features: user, session, and event attributes

We'll walk through representative examples from each category to demonstrate the feature creation process. While these examples aren't exhaustive, they'll show you the patterns you can apply to create your own custom features.

Loading raw features data

Before diving into the specific feature types listed above, we need to set up our data processing environment. We'll use Polars, a high-performance data manipulation library built in Rust, for its efficient handling of large-scale data transformations. While any data transformation library could work here (pandas, PySpark, etc.), Polars offers some unique advantages we'll explore later in this post.

First, let's load our dependencies and prepare our dataset:

Defining feature extraction expressions

Now, we can use Polars to extract meaningful fields from the nested JSON data found in the raw features data. Polars provides an efficient way to parse JSON strings using JSONPath syntax. Here’s how it works:

Select a JSON column using pl.col()
Extract values using .str.json_path_match() with JSONPath syntax (where $. represents the root of the JSON string)
Rename the extracted field using .alias()

Here are a few examples of expressions you might define:

User agent information

Location data

Interaction details

Apply extractions to create feature columns

The above expressions form the building blocks for our feature engineering pipeline. Now, we can apply them to extract and transform our raw data into meaningful features. Polars' with_columns() method efficiently handles multiple column additions in a single operation:

Our dataframe grew from (269, 12) to (269, 25) columns, adding 13 new features extracted from our nested JSON data. These newly extracted columns will serve as the foundation for our more complex feature engineering in the next section.

→ Note: Sometimes you'll need to create new extractors to access specific data points in your feature engineering pipeline. For instance, if you wanted to group US states into regions (“Northeast”, “Midwest”, etc.), you'd first need to extract location.region from source_properties before being able to apply any transformations.

Feature engineering

With our signals extracted from the raw JSON data, we can now transform them into meaningful features for our purchase prediction model. As mentioned before, we’ll focus on three key feature categories: (1) Numeric Features (2) Binary Features (3) Categorical Features.

1. Numeric features (event aggregations)

Numeric features quantify user behavior patterns within a session or time window. Common examples include:

Event counts (total clicks, page views)
Time-based metrics (session duration, time between actions)
Ratio metrics (error rate, conversion rate)
Performance metrics (scroll depth, page load times)

Let's create our first set of features using the web performance metrics Cumulative Layout Shift (CLS) and DOM Content Load Time (DCT). These metrics can significantly impact user experience and, consequently, purchase likelihood. We'll calculate the 90th percentile (p90) of these metrics for each page to identify potential performance issues:

→ For full details on more performance metrics like these, see Fullstory Developer Docs.

2. Binary features (event occurrence)

Binary features capture whether specific events or conditions occurred during a session. These yes/no indicators can be powerful predictors of user behavior:

Did the user encounter any errors?
Did they perform specific key actions?
Were they using a mobile device?

Let's create binary features from two crucial interaction types: clicks and page navigations. Understanding which elements users click and which pages they visit can reveal their intent and likelihood to purchase.

The code below creates features like clicked_text:Cart (did they click the Cart link?) and visited_page:blog (did they navigate to the blog page?). These binary signals, especially in combination, can strongly indicate a user's purchase intent.

These event counts give us flexibility in feature representation. We can use them as-is to capture interaction intensity (how many times did they click the pricing button?), or convert them to pure binary features (did they click the pricing button at all?). The choice depends on your modeling needs:

Count features preserve information about interaction frequency, which might indicate higher engagement or confusion
Binary features simplify the signal to presence/absence, which can be more robust when frequency isn't meaningful

For our purchase prediction model, we'll keep both representations initially and let our feature selection process determine which works better for each interaction type.

3. Categorical features (user/session/event attributes)

Categorical features capture qualitative aspects of user behavior and context. These attributes can reveal important patterns in user segments and their likelihood to purchase:

User properties (Paid vs. Free users, Account type)
Transaction attributes (Payment method, Subscription tier)
Session context (Entry/exit pages, Traffic source)
Technical context (Location, Browser, Device)

Let's transform some key categorical attributes into ML-ready features using dummy encoding. This technique creates binary columns for each category value, making qualitative data digestible for machine learning models:

These transformations create features like prop_device:mobile and prop_country:US, allowing our model to learn purchase patterns specific to different user segments and contexts. For example, we might discover that mobile users from certain countries have distinct purchase behaviors, enabling more targeted optimization of their experience.

Final training dataset creation

Now we'll combine our engineered features with our target variable to create our final training dataset:

We've now transformed raw Fullstory events into a rich set of predictive features, each capturing different aspects of user behavior. This structured dataset allows machine learning models to learn patterns that indicate purchase likelihood. While we've covered key feature engineering techniques here, remember that feature creation is an iterative process. As you understand your users better and gather more data, you can continue to refine and expand your feature set.

→ Note: While we're working with sequential behavioral data, our current approach focuses on building features for traditional binary classification models rather than sequential models. While sequential models (like RNNs or transformers) can capture temporal patterns directly, we've found that carefully engineered aggregate features combined with traditional classifiers often provide excellent results while being simpler to implement and maintain. The choice between sequential and binary classification approaches is a fascinating topic that deserves its own discussion in a future post.

Dimension 2: Engineering for scale and performance

Earlier, we broke down our approach into two key dimensions: the data science "what" (defining meaningful features) and the engineering "how" (building an efficient, scalable pipeline). Having explored the "what" through our feature engineering examples, let's now tackle the "how": the engineering challenges of implementing these features at scale.

The first major decision is choosing the right data processing framework. Python offers several powerful options: pandas for its ease of use, PySpark for distributed computing, and newer alternatives like Polars for its balance of performance and developer experience. Each has its strengths, and the right choice depends on your specific needs. Before settling on a framework, consider these aspects that can help inform your decision:

1. Pipeline stability and maintenance: Many teams struggle with maintaining different implementations for development and production environments. Running one version locally for testing and another version in production creates a maintenance burden—every feature addition or bug fix needs to be implemented twice. This dual-implementation approach not only increases development costs but also risks introducing inconsistencies between environments.

2. Configuration and agility: Feature engineering requires extensive experimentation. Hardcoding feature logic into scripts makes iteration slow and cumbersome. Teams need a flexible, configurable approach that supports rapid experimentation without requiring significant engineering effort for each new idea.

3. Standardization: Without a unified framework, teams often reinvent the wheel for each new project. This leads to inconsistent methodologies, duplicated effort, and difficulty in sharing best practices across the organization.

4. Modularity and testability: Feature engineering pipelines need to be both modular and testable. Each feature type should have its own transformation function that can be unit tested independently. This separation of concerns allows teams to maintain code quality while adding new features or modifying existing ones. Without this modularity, pipelines become monolithic and difficult to validate.

5. Observability and quality control: As feature engineering pipelines grow in complexity, monitoring becomes crucial. Teams need built-in validation and quality checks to ensure features behave as expected across environments. This includes monitoring for data drift, catching anomalies, and validating transformation outputs. Without robust observability, issues can go undetected until they impact model performance.

These challenges influenced our choice of Polars for this implementation. Let's examine how it addresses each concern:

Pipeline stability: Polars' Rust-based engine delivers consistent, production-grade performance across environments. Whether running locally on a laptop or in production on powerful VMs, the same code works efficiently, eliminating the need for separate implementations.
Configuration and agility: Polars' expressive API and lazy evaluation enable us to define feature transformations declaratively. This makes it easy to experiment with new features by modifying configurations rather than rewriting code. The lazy evaluation also helps optimize complex transformation chains automatically.
Standardization: Polars provides a unified approach to data transformation that works consistently at any scale. Its pandas-like API feels familiar to data scientists while offering better performance, making it easier to establish and share best practices across teams.
Modularity and testability: Polars' functional approach to data transformation naturally encourages modular code. Each transformation step can be isolated, tested, and composed into larger pipelines. The strong type system and clear error messages make it easier to catch issues early in development.
Observability: Polars includes built-in profiling tools that help monitor pipeline performance. Its efficient memory usage and clear execution plans make it easier to track transformations and identify potential issues before they affect production models.

While no single framework solves all challenges perfectly, Polars provides a solid foundation for building scalable, maintainable feature engineering pipelines.

Lessons learned

Throughout our journey of building and refining our own internal feature engineering pipeline, we've learned several valuable lessons that extend beyond technical implementation details. These insights can help guide your own feature engineering efforts, whether you're working with behavioral data or other complex datasets.

Lesson 1: Cache data strategically for development speed

Raw behavioral data is inherently large and complex. We quickly learned that querying it repeatedly during feature development was both time-consuming and expensive. Implementing a smart caching strategy proved crucial for maintaining development velocity. By processing raw data once and saving intermediate, event-filtered data as parquet files, we:

Dramatically reduced iteration time for data scientists
Lowered computational costs
Enabled isolated testing of specific event types
Improved storage efficiency through columnar compression

This approach lets our team focus on feature engineering rather than waiting for data processing. Just remember to implement a clear cache invalidation strategy to ensure you're not working with stale data.

Lesson 2: Build trust through rigorous testing

Data pipelines can be fragile. Small upstream changes often cascade into significant downstream issues. We learned that robust testing isn't just a nice-to-have; it's essential for maintaining confidence in your feature engineering process. Our testing strategy has evolved to include:

Automated data quality checks (null values, unexpected distributions, outliers)
Property-based testing with Hypothesis (which comes out of the box in Polars)
Schema validation to ensure feature consistency across pipeline runs
Edge case detection, particularly important with behavioral data's natural variability
Performance benchmarking to catch efficiency regressions

Fullstory's session replay capability proved particularly valuable here, allowing us to validate unusual patterns by watching the actual user sessions that generated them. This combination of automated testing and manual validation helped us catch issues early and build trust in our pipeline's output.

Lesson 3: Prioritize developer experience

While much attention goes to computational performance, we discovered that developer experience acts as a powerful multiplier for team productivity. A feature engineering pipeline is only as good as its usability. The easier it is for team members to experiment, debug, and deploy features, the more value you'll get from your investment.

Key aspects that amplified our team's effectiveness:

Clear, consistent API design that feels natural to both data scientists and engineers
Comprehensive documentation that explains not just how, but why
Local development tooling that mirrors the production environment
Quick feedback loops for rapid iteration and learning
Simplified debugging processes with clear error messages
Single unified pipeline that reduces cognitive load and maintenance overhead

This focus on developer ergonomics paid unexpected dividends: team members were more likely to experiment with new features, bugs were caught earlier, and knowledge sharing improved naturally. What might seem like "nice-to-have" improvements often turned out to be critical productivity multipliers.

Lesson 4: Design for evolution

Perhaps our most important insight was that feature engineering pipelines need to be designed for change. As your understanding of user behavior grows and business needs evolve, your feature set will need to adapt. By building modular components that could be easily modified or replaced, we created a system that could evolve with our needs.

Best practices that enabled this flexibility:

Feature-specific transformation modules that can be modified independently
Configuration-driven feature definitions for easy experimentation
Clear interfaces between pipeline stages to minimize coupling
Versioned feature sets to track changes and ensure reproducibility

These lessons—strategic caching, rigorous testing, developer experience, and modular design—reflect a broader truth about feature engineering: success isn't just about the technical implementation, it's about building sustainable systems that empower teams to iterate and improve continuously.

What’s next?

Building a robust feature engineering pipeline for behavioral data requires balancing technical capabilities with practical needs.

Whether you're just starting with behavioral modeling or scaling existing efforts, remember that successful feature engineering is more than creating predictive signals. It's about building a sustainable foundation for data-driven decision making. With the right tools and rich behavioral data, your pipeline can unlock insights hidden within millions of user interactions, revealing patterns that drive real business value.

→ Ready to build your own behavioral feature engineering pipeline? Here are several ways to get started:

Get started with Anywhere Warehouse for immediate access to structured behavioral data
Visit the Fullstory Developer Guide for detailed technical guides and best practices
Check out our Guide to processing Fullstory events with dbt

Whether you're just starting with behavioral analytics or scaling your machine learning efforts, we're here to help you unlock the full potential of your behavioral data. Let's build something powerful together.

Build a scalable feature engineering pipeline with polars