Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Performance Mode
DataStore has two compatibility modes that control whether output is shaped for pandas compatibility or optimized for raw SQL performance.
Overview
| Mode | compat_mode value | Description |
|---|
| Pandas (default) | "pandas" | Full pandas behavior compatibility. Row order preserved, MultiIndex, set_index, dtype corrections, stable sort tiebreakers, -If/isNaN wrappers. |
| Performance | "performance" | SQL-first execution. All pandas compatibility overhead removed. Maximum throughput, but results may differ structurally from pandas. |
What Performance Mode Disables
| Overhead | Pandas mode behavior | Performance mode behavior |
|---|
| Row-order preservation | _row_id injection, rowNumberInAllBlocks(), __orig_row_num__ subqueries | Disabled — row order not guaranteed |
| Stable sort tiebreaker | rowNumberInAllBlocks() ASC appended to ORDER BY | Disabled — ties may have arbitrary order |
| Parquet preserve_order | input_format_parquet_preserve_order=1 | Disabled — parallel Parquet reading allowed |
| GroupBy auto ORDER BY | ORDER BY group_key added (pandas default sort=True) | Disabled — groups returned in arbitrary order |
| GroupBy dropna WHERE | WHERE key IS NOT NULL added (pandas default dropna=True) | Disabled — NULL groups included |
| GroupBy set_index | Group keys set as index | Disabled — group keys stay as columns |
| MultiIndex columns | agg({'col': ['sum','mean']}) returns MultiIndex columns | Disabled — flat column names (col_sum, col_mean) |
-If/isNaN wrappers | sumIf(col, NOT isNaN(col)) for skipna | Disabled — plain sum(col) (ClickHouse natively skips NULL) |
toInt64 on count | toInt64(count()) to match pandas int64 | Disabled — native SQL dtype returned |
fillna(0) for all-NaN sum | Sum of all-NaN returns 0 (pandas behavior) | Disabled — returns NULL |
| Dtype corrections | abs() unsigned→signed, etc. | Disabled — native SQL dtypes |
| Index preservation | Restores original index after SQL execution | Disabled |
first()/last() | argMin/argMax(col, rowNumberInAllBlocks()) | any(col) / anyLast(col) — faster but non-deterministic |
| Single-SQL aggregation | ColumnExpr groupby materializes intermediate DataFrame | Injects LazyGroupByAgg into lazy ops chain — single SQL query |
Enabling Performance Mode
Using config object
from chdb.datastore.config import config
# Enable performance mode
config.use_performance_mode()
# Back to pandas compatibility
config.use_pandas_compat()
# Check current mode
print(config.compat_mode) # 'pandas' or 'performance'
Using module-level functions
from chdb.datastore.config import set_compat_mode, CompatMode, is_performance_mode
# Enable performance mode
set_compat_mode(CompatMode.PERFORMANCE)
# Check
print(is_performance_mode()) # True
# Back to default
set_compat_mode(CompatMode.PANDAS)
Using convenience imports
from chdb import use_performance_mode, use_pandas_compat
use_performance_mode()
# ... high-performance operations ...
use_pandas_compat()
Setting performance mode automatically sets the execution engine to chdb. You do not need to call config.use_chdb() separately.
When to Use Performance Mode
Use performance mode when:
- Processing large datasets (hundreds of thousands to millions of rows)
- Running aggregation-heavy workloads (groupby, sum, mean, count)
- Row order does not matter (e.g., aggregated results, reports, dashboards)
- You want maximum SQL throughput and minimal overhead
- Memory usage is a concern (parallel Parquet reading, no intermediate DataFrames)
Stay in pandas mode when:
- You need exact pandas behavior (row order, MultiIndex, dtypes)
- You rely on
first()/last() returning the true first/last row
- You use
shift(), diff(), cumsum() that depend on row order
- You’re writing tests that compare DataStore output with pandas
Behavior Differences
Row Order
In performance mode, row order is not guaranteed for any operation. This includes:
- Filter results
- GroupBy aggregation results
head() / tail() without explicit sort_values()
first() / last() aggregations
If you need ordered results, add an explicit sort_values():
config.use_performance_mode()
ds = pd.read_csv("data.csv")
# Unordered (fast)
result = ds.groupby("region")["revenue"].sum()
# Ordered (still fast, just adds ORDER BY)
result = ds.groupby("region")["revenue"].sum().sort_values()
GroupBy Results
| Aspect | Pandas mode | Performance mode |
|---|
| Group key location | Index (via set_index) | Regular column |
| Group order | Sorted by key (default) | Arbitrary order |
| NULL groups | Excluded (default dropna=True) | Included |
| Column format | MultiIndex for multi-agg | Flat names (col_func) |
first()/last() | Deterministic (row order) | Non-deterministic (any()/anyLast()) |
Aggregation
config.use_performance_mode()
# Sum of all-NaN group returns NULL (not 0)
# Count returns native uint64 (not forced int64)
# No -If wrappers: sum() instead of sumIf()
result = ds.groupby("cat")["val"].sum()
Single-SQL Execution
In performance mode, ColumnExpr groupby aggregation (e.g., ds[condition].groupby('col')['val'].sum()) is executed as a single SQL query instead of the two-step process used in pandas mode:
config.use_performance_mode()
# Pandas mode: two SQL queries (filter → materialize → groupby)
# Performance mode: one SQL query (WHERE + GROUP BY in same query)
result = ds[ds["rating"] > 3.5].groupby("category")["revenue"].sum()
# Generated SQL (single query):
# SELECT category, sum(revenue) FROM data WHERE rating > 3.5 GROUP BY category
This eliminates the intermediate DataFrame materialization and can significantly reduce memory usage and execution time.
Comparison with Execution Engine
Performance mode (compat_mode) and execution engine (execution_engine) are independent configuration axes:
| Config | Controls | Values |
|---|
execution_engine | Which engine runs the computation | auto, chdb, pandas |
compat_mode | Whether to reshape output for pandas compatibility | pandas, performance |
Setting compat_mode='performance' automatically sets execution_engine='chdb', since performance mode is designed for SQL execution.
from chdb.datastore.config import config
# These are independent
config.use_chdb() # Force chDB engine, keep pandas compat
config.use_performance_mode() # Force chDB + remove pandas overhead
Testing with Performance Mode
When writing tests for performance mode, results may differ from pandas in row order and structural format. Use these strategies:
Sort-then-compare (aggregations, filters)
# Sort both sides by the same columns before comparing
ds_result = ds.groupby("cat")["val"].sum()
pd_result = pd_df.groupby("cat")["val"].sum()
ds_sorted = ds_result.sort_index()
pd_sorted = pd_result.sort_index()
np.testing.assert_array_equal(ds_sorted.values, pd_sorted.values)
Value-range check (first/last)
# first() with any() returns an arbitrary element from the group
result = ds.groupby("cat")["val"].first()
for group_key in groups:
assert result.loc[group_key] in group_values[group_key]
Schema-and-count (LIMIT without ORDER BY)
# head() without sort_values: row set is non-deterministic
result = ds.head(5)
assert len(result) == 5
assert set(result.columns) == expected_columns
Best Practices
- Enable early in your script
from chdb.datastore.config import config
config.use_performance_mode()
# All subsequent operations benefit
ds = pd.read_parquet("data.parquet")
result = ds[ds["amount"] > 100].groupby("region")["amount"].sum()
- Add explicit sorting when order matters
# For display or downstream processing that expects order
result = (ds
.groupby("region")["revenue"].sum()
.sort_values(ascending=False)
)
- Use for batch/ETL workloads
config.use_performance_mode()
# ETL pipeline — order doesn't matter, throughput does
summary = (ds
.filter(ds["date"] >= "2024-01-01")
.groupby(["region", "product"])
.agg({"revenue": "sum", "quantity": "sum", "rating": "mean"})
)
summary.to_df().to_parquet("summary.parquet")
- Switch modes within a session
# Performance mode for heavy computation
config.use_performance_mode()
aggregated = ds.groupby("cat")["val"].sum()
# Back to pandas mode for exact-match comparison
config.use_pandas_compat()
detailed = ds[ds["val"] > 100].head(10)