SQL vs. Python for Feature Engineering: When Each Wins

SQL vs. Python for Feature Engineering: When Each Wins

Feature engineering is where models are won and lost. It's also where ML teams waste enormous amounts of time arguing about tooling rather than building. The SQL vs. Python debate surfaces constantly, and most articles resolve it by quietly promoting whichever tool the author already uses.

This article tries to be different. Both SQL and Python are legitimate tools for feature engineering. Both have real weaknesses. The question is not which one is better — it's which one is better for a specific kind of feature in a specific team context. Getting this wrong means either a Python codebase full of rolling-window aggregations that a data engineer has to reverse-engineer, or a SQL layer stuffed with string manipulation logic that belongs in a pandas function.

Let's be precise about when each tool wins.


The False Dichotomy: It's Not "SQL or Python"

The framing of SQL versus Python is misleading from the start. Production feature pipelines at most mature ML teams use both. The interesting question is not "which language?" but "which language for which layer of the pipeline?"

A typical feature pipeline has at least two conceptually distinct phases:

  1. Feature computation: aggregating raw events into meaningful signals (rolling counts, ratios, window averages)
  2. Feature transformation: normalizing, encoding, scaling, or combining those signals before they enter a model

SQL is well-suited to the first phase. Python is well-suited to the second. Treating them as competitors conflates two different jobs.

That said, teams do sometimes have to pick one or the other, especially early on. And there are real tradeoffs in maintenance, testability, and skill distribution that matter. So let's examine each on its merits.


What SQL Excels at in Feature Engineering

Aggregations and window functions

SQL's strength in feature engineering is not a marketing claim — it's structural. Relational algebra maps directly onto the most common feature computation patterns: group-by aggregations, rolling windows, ratios between subgroups. These operations are what SQL databases were designed for.

Here is a simple rolling 7-day purchase feature in SQL, using RisingWave (a streaming SQL database) as the engine:

-- Rolling 7-day purchase count by user (RisingWave materialized view)
CREATE MATERIALIZED VIEW user_purchase_features AS
SELECT
    user_id,
    COUNT(*) AS purchase_count_7d,
    SUM(amount) AS total_spend_7d,
    AVG(amount) AS avg_order_value_7d
FROM orders
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY user_id;

The same feature in Python with pandas:

from datetime import timedelta

def compute_purchase_features(orders_df, as_of_date):
    window_start = as_of_date - timedelta(days=7)
    filtered = orders_df[
        (orders_df['created_at'] >= window_start) &
        (orders_df['created_at'] < as_of_date)
    ]
    return filtered.groupby('user_id').agg(
        purchase_count_7d=('order_id', 'count'),
        total_spend_7d=('amount', 'sum'),
        avg_order_value_7d=('amount', 'mean')
    ).reset_index()

Both produce the same result. But consider the maintenance surface:

  • The SQL version is readable by any data engineer, analyst, or backend developer. The Python version requires Python fluency.
  • The SQL version integrates naturally into a data warehouse or streaming system. The Python version needs orchestration code to schedule it, read from storage, and write results somewhere.
  • The SQL version has a declarative quality — it says what you want. The Python version describes how to compute it.

For this category of feature, SQL is the lower-friction choice once your data is already in a SQL-queryable system.

Joins and multi-table features

Features that depend on multiple tables benefit enormously from SQL. Point-in-time joins — matching user state at the moment an event occurred — are notoriously painful in Python but expressible in a few lines of SQL:

-- Point-in-time join: attach user account age at the time of each transaction
SELECT
    t.transaction_id,
    t.user_id,
    t.amount,
    t.created_at,
    DATEDIFF('day', u.signup_date, t.created_at) AS account_age_days
FROM transactions t
JOIN users u ON t.user_id = u.user_id;

Reproducing this correctly in Python requires careful timestamp alignment to avoid data leakage. SQL makes the join logic explicit and reviewable.

Streaming features

This is where SQL-based engines like RisingWave have a genuine edge over batch Python pipelines. If you need features that are continuously updated as new events arrive — fraud risk scores, real-time recommendation signals, live inventory ratios — streaming SQL lets you define the feature once and have it maintained incrementally:

-- Continuously maintained: user's ratio of failed to successful logins (last 1 hour)
CREATE MATERIALIZED VIEW login_risk_features AS
SELECT
    user_id,
    COUNT(*) FILTER (WHERE status = 'failed') AS failed_logins_1h,
    COUNT(*) AS total_logins_1h,
    COUNT(*) FILTER (WHERE status = 'failed')::float
        / NULLIF(COUNT(*), 0) AS failure_rate_1h
FROM login_events
WHERE event_time >= NOW() - INTERVAL '1 hour'
GROUP BY user_id;

Replicating this in Python means running a scheduled job, managing state across runs, and handling late-arriving data yourself. The SQL version handles all of that inside the engine.

Auditability

A materialized view definition is a documented artifact. Anyone can read it, trace the logic, and verify that the feature means what the name implies. Python feature functions buried inside a Jupyter notebook or a library module require more effort to audit. For regulated industries — credit, insurance, healthcare — this matters.


What Python Excels at in Feature Engineering

Custom ML transformations

Once features are computed, transforming them for model consumption almost always belongs in Python. Normalization, encoding, imputation, polynomial interaction terms, custom loss-calibrated transformations — these are firmly in scikit-learn territory:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, TargetEncoder
from sklearn.impute import SimpleImputer

feature_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale', StandardScaler()),
    ('encode', TargetEncoder(target_type='continuous'))
])

X_train_transformed = feature_pipeline.fit_transform(X_train, y_train)
X_test_transformed = feature_pipeline.transform(X_test)

There is no reasonable SQL equivalent for a fitted StandardScaler or a TargetEncoder. These transformations are stateful — they depend on statistics learned from training data — and they are designed to be serialized, versioned, and applied consistently at inference time. Python's scikit-learn ecosystem handles this well. SQL does not.

Graph and network features

User-to-user relationships, product co-purchase graphs, fraud ring detection — these require graph traversal. SQL can express some graph queries (recursive CTEs), but they are verbose and often slow on large graphs. Python with NetworkX or graph neural network libraries is far more expressive here:

import networkx as nx

def compute_user_graph_features(edges_df):
    G = nx.from_pandas_edgelist(edges_df, source='user_id', target='contact_id')
    return {
        'degree_centrality': nx.degree_centrality(G),
        'clustering_coefficient': nx.clustering(G),
        'pagerank': nx.pagerank(G)
    }

Trying to compute PageRank or clustering coefficients in SQL is technically possible but practically miserable.

Complex text and parsing logic

Parsing raw strings into structured features — extracting domain names from email addresses, classifying user agents, tokenizing product descriptions — belongs in Python. Regular expressions are better supported, and libraries like spaCy, Hugging Face transformers, and NLTK add capabilities that SQL simply does not have.

import re

def extract_email_features(email):
    domain = email.split('@')[-1].lower()
    is_free_provider = domain in {'gmail.com', 'yahoo.com', 'hotmail.com'}
    tld = domain.split('.')[-1]
    return {
        'email_domain': domain,
        'is_free_provider': is_free_provider,
        'email_tld': tld
    }

Rapid prototyping

A Jupyter notebook with pandas lets an ML engineer explore hypotheses, visualize feature distributions, and test correlations in minutes. SQL requires at minimum a connection to a query engine, and iterative exploration in SQL lacks the rich visualization and introspection tools available in the Python data science stack. For research and experimentation, Python wins on iteration speed.


The Decision Framework: A Concrete Checklist

Use this when deciding where a feature belongs:

Use SQL when:

  • The feature is a count, sum, average, ratio, or percentile over a time window or group
  • The feature joins two or more tables together
  • The feature needs to be continuously updated as new data arrives (streaming)
  • Data engineers or analysts need to review or maintain the feature logic
  • The feature needs to be auditable (regulated industry, model governance)
  • The feature will be served at high QPS from a low-latency store

Use Python when:

  • The feature requires a fitted transformation (scaler, encoder, imputer) trained on historical data
  • The feature depends on graph structure or traversal
  • The feature involves NLP, embeddings, or custom model inference
  • You are exploring or prototyping and need fast iteration
  • The feature uses a library with no SQL equivalent (scipy, networkx, transformers)
  • The transformation logic is complex enough that readability matters more than accessibility to non-Python engineers

The gray zone: Percentile calculations, histogram features, and some string features can be done reasonably in either tool. Here, team skill distribution should break the tie. If your team has strong SQL coverage, keep it in SQL. If Python is dominant, keep it there. Consistency within a team often matters more than the theoretically optimal choice.


The Hybrid Architecture: SQL for Streaming Computation, Python for Training

The most effective pattern most teams converge on is a two-layer pipeline:

Layer 1 — SQL for feature computation (streaming)

A streaming SQL engine like RisingWave, or a batch SQL engine like Spark SQL or dbt on a warehouse, handles the computation-heavy work: rolling aggregations, joins, filtering. Results are materialized to a feature store or a serving database. This layer is maintained by data engineers and ML engineers with SQL skills.

-- Materialized in RisingWave, served from Redis or Postgres
CREATE MATERIALIZED VIEW user_risk_features AS
SELECT
    user_id,
    COUNT(*) FILTER (WHERE status = 'declined') AS declined_txns_24h,
    MAX(amount) AS max_txn_amount_24h,
    COUNT(DISTINCT merchant_id) AS distinct_merchants_24h,
    AVG(amount) AS avg_txn_amount_24h
FROM transactions
WHERE txn_time >= NOW() - INTERVAL '24 hours'
GROUP BY user_id;

Layer 2 — Python for feature transformation and model training

A Python training pipeline reads features from the feature store, applies transformations (scaling, encoding, imputation), and trains the model. The fitted transformation pipeline is serialized and used at inference time.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

# Read pre-computed features from feature store
features = pd.read_sql("SELECT * FROM user_risk_features", conn)
labels = pd.read_sql("SELECT user_id, is_fraud FROM ground_truth", conn)
df = features.merge(labels, on='user_id')

X = df.drop(columns=['user_id', 'is_fraud'])
y = df['is_fraud']

model_pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('clf', GradientBoostingClassifier())
])

model_pipeline.fit(X_train, y_train)

This architecture keeps each layer doing what it is good at. The SQL layer is fast, scalable, auditable, and incrementally maintained. The Python layer has full access to the ML ecosystem and handles stateful transformations correctly.


Team Dynamics: Who Writes and Maintains What

Tooling decisions are also people decisions. When you push feature logic into SQL, you are implicitly saying that data engineers and analysts can contribute to the feature pipeline. That is often a good thing — they understand the data better than ML engineers do, and they can catch modeling errors early.

When you push feature logic into Python, you limit the contributor pool to people who are comfortable with pandas and Python data structures. That may be appropriate for complex transformations, but it creates a bottleneck if applied to simple aggregations that data engineers could own easily.

A practical split for team organization:

Feature typeOwnerTool
Rolling aggregations, counts, ratiosData engineerSQL
Join-based features, multi-table logicData engineerSQL
Streaming featuresData/ML engineerStreaming SQL
Normalization, encoding, imputationML engineerPython (sklearn)
Experimental featuresML researcherPython (pandas/notebook)
Graph featuresML engineerPython (networkx)
NLP/embedding featuresML engineerPython (transformers)

Code review gets easier when the tooling matches the expertise. Data engineers reviewing SQL PRs catch data quality issues. ML engineers reviewing Python PRs catch modeling issues. When these concerns are intermixed in a single Python codebase, review quality degrades.


Migration Path: Moving Python Features to SQL

If your team started in Python and wants to move some features to SQL, a gradual approach reduces risk.

Step 1: Identify candidates

Features that are pure aggregations over a time window with no custom transformation logic are the best candidates. Look for pandas groupby + agg patterns over date-filtered data. These translate directly to SQL window aggregations.

Step 2: Write the SQL equivalent and validate

Run both the Python function and the SQL query over the same historical dataset. Compare outputs row-by-row. Treat any discrepancy as a bug to investigate — not always in the SQL, sometimes in the original Python.

# Validation: compare SQL output to Python function output
sql_features = pd.read_sql("SELECT * FROM user_purchase_features", conn)
python_features = compute_purchase_features(orders_df, as_of_date=datetime.utcnow())

merged = sql_features.merge(python_features, on='user_id', suffixes=('_sql', '_py'))
discrepancies = merged[
    abs(merged['purchase_count_7d_sql'] - merged['purchase_count_7d_py']) > 0
]
print(f"Rows with discrepancies: {len(discrepancies)}")

Step 3: Shadow mode

Run both implementations in parallel during a shadow period. Log disagreements. Only retire the Python version once you trust the SQL output.

Step 4: Update the training pipeline

Switch the training pipeline to read from the SQL-materialized feature table instead of calling the Python function. Retrain the model on the new feature source and validate that model performance does not degrade.

Step 5: Delete the Python code

Remove the Python feature function. The SQL definition is now the source of truth.

Not every feature is worth migrating. Leave graph features, NLP features, and complex transformation logic in Python. The goal is not to rewrite everything in SQL — it's to move the aggregation-heavy work to the layer that handles it better.


FAQ

Can I use SQL engines like dbt or Spark SQL instead of RisingWave for the SQL layer?

Yes. dbt with a warehouse (BigQuery, Snowflake, Redshift) handles batch feature computation well. Spark SQL works for large-scale batch pipelines. The key difference with a streaming SQL engine like RisingWave is latency: materialized views in RisingWave update incrementally as events arrive, giving you sub-second feature freshness. Batch SQL engines require scheduled runs, so feature freshness is bounded by the job frequency — typically minutes to hours. For fraud detection or real-time personalization, that gap matters. For offline training data, it often does not.

What about feature stores? Do they eliminate this debate?

Feature stores (Feast, Hopsworks, Tecton) manage the serving and storage layer, but they do not compute features themselves. You still need to decide whether to compute features with SQL or Python. Feature stores are an additional component on top of this architecture, not a replacement for the SQL vs. Python decision.

Is SQL harder to test than Python?

SQL unit testing is less mature than Python testing, but tools like dbt's test framework, SQLMesh, and Great Expectations have closed the gap significantly. Python functions are easier to unit test in isolation. For integration testing — does the feature produce correct values against a real dataset — both are roughly equivalent in effort.

What about Polars or DuckDB? Are those a middle ground?

Polars (Python library with Rust internals) and DuckDB (embeddable SQL engine with Python API) both offer high-performance alternatives to pandas for batch feature computation. DuckDB in particular lets you write SQL inside a Python environment, which blurs the boundary usefully. These are good options for teams that want SQL expressiveness without standing up a separate database. For streaming use cases, they do not currently match a purpose-built streaming SQL engine.


Closing Thought

The teams that waste the most time on this debate are the ones treating it as an identity question. SQL is not "the data engineering tool." Python is not "the real ML tool." They are different tools with different strengths, and the most productive ML teams use both intentionally.

If you are doing aggregations, joins, and window functions over event data — especially if freshness matters — SQL is likely the right layer. If you are fitting transformations, running graph algorithms, or prototyping new features — Python is likely the right layer. Most pipelines need both.

The goal is a feature pipeline that is maintainable, auditable, and correct. Tool choices should serve that goal, not the other way around.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.