Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Execution Engine Configuration
DataStore can execute operations using different backends. This guide explains how to configure and optimize engine selection.
Available Engines
| Engine | Description | Best For |
|---|
auto | Automatically selects best engine per operation | General use (default) |
chdb | Forces all operations through ClickHouse SQL | Large datasets, aggregations |
pandas | Forces all operations through pandas | Compatibility testing, pandas-specific features |
Setting the Engine
Global Configuration
from chdb.datastore.config import config
# Option 1: Using set method
config.set_execution_engine('auto') # Default
config.set_execution_engine('chdb') # Force ClickHouse
config.set_execution_engine('pandas') # Force pandas
# Option 2: Using shortcuts
config.use_auto() # Auto-select
config.use_chdb() # Force ClickHouse
config.use_pandas() # Force pandas
Checking Current Engine
print(config.execution_engine) # 'auto', 'chdb', or 'pandas'
Auto Mode
In auto mode (default), DataStore selects the optimal engine for each operation:
Operations Executed in chDB
- SQL-compatible filtering (
filter(), where())
- Column selection (
select())
- Sorting (
sort(), orderby())
- Grouping and aggregation (
groupby().agg())
- Joins (
join(), merge())
- Distinct (
distinct(), drop_duplicates())
- Limiting (
limit(), head(), tail())
Operations Executed in pandas
- Custom apply functions (
apply(custom_func))
- Complex pivot tables with custom aggregations
- Operations not expressible in SQL
- When input is already a pandas DataFrame
Example
from chdb import datastore as pd
from chdb.datastore.config import config
config.use_auto() # Default
ds = pd.read_csv("data.csv")
# This uses chDB (SQL)
result = (ds
.filter(ds['amount'] > 100) # SQL: WHERE
.groupby('region') # SQL: GROUP BY
.agg({'amount': 'sum'}) # SQL: SUM()
)
# This uses pandas (custom function)
result = ds.apply(lambda row: complex_calculation(row), axis=1)
chDB Mode
Force all operations through ClickHouse SQL:
When to Use
- Processing large datasets (millions of rows)
- Heavy aggregation workloads
- When you want maximum SQL optimization
- Consistent behavior across all operations
| Operation Type | Performance |
|---|
| GroupBy/Aggregation | Excellent (up to 20x faster) |
| Complex Filtering | Excellent |
| Sorting | Very Good |
| Simple Single Filters | Good (slight overhead) |
Limitations
- Custom Python functions may not be supported
- Some pandas-specific features require conversion
pandas Mode
Force all operations through pandas:
When to Use
- Compatibility testing with pandas
- Using pandas-specific features
- Debugging pandas-related issues
- When data is already in pandas format
| Operation Type | Performance |
|---|
| Simple Single Operations | Good |
| Custom Functions | Excellent |
| Complex Aggregations | Slower than chDB |
| Large Datasets | Memory intensive |
Cross-DataStore Engine
Configure the engine for operations that combine columns from different DataStores:
# Set cross-DataStore engine
config.set_cross_datastore_engine('auto')
config.set_cross_datastore_engine('chdb')
config.set_cross_datastore_engine('pandas')
Example
ds1 = pd.read_csv("sales.csv")
ds2 = pd.read_csv("inventory.csv")
# This operation involves two DataStores
result = ds1.join(ds2, on='product_id')
# Uses cross_datastore_engine setting
Engine Selection Logic
Auto Mode Decision Tree
Operation requested
│
├─ Can be expressed in SQL?
│ │
│ ├─ Yes → Use chDB
│ │
│ └─ No → Use pandas
│
└─ Cross-DataStore operation?
│
└─ Use cross_datastore_engine setting
Function-Level Override
Some functions can have their engine explicitly configured:
from chdb.datastore.config import function_config
# Force specific functions to use specific engine
function_config.use_chdb('length', 'substring')
function_config.use_pandas('upper', 'lower')
See Function Config for details.
Benchmark results on 10M rows:
| Operation | pandas (ms) | chdb (ms) | Speedup |
|---|
| GroupBy count | 347 | 17 | 19.93x |
| Combined ops | 1,535 | 234 | 6.56x |
| Complex pipeline | 2,047 | 380 | 5.39x |
| Filter+Sort+Head | 1,537 | 350 | 4.40x |
| GroupBy agg | 406 | 141 | 2.88x |
| Single filter | 276 | 526 | 0.52x |
Key insights:
- chDB excels at aggregations and complex pipelines
- pandas is slightly faster for simple single operations
- Use
auto mode to get the best of both
Best Practices
- Start with Auto Mode
config.use_auto() # Let DataStore decide
- Profile Before Forcing
config.enable_profiling()
# Run your workload
# Check profiler report to see where time is spent
- Force Engine for Specific Workloads
# For heavy aggregation workloads
config.use_chdb()
# For pandas compatibility testing
config.use_pandas()
- Use explain() to Understand Execution
ds = pd.read_csv("data.csv")
query = ds.filter(ds['age'] > 25).groupby('city').agg({'salary': 'sum'})
# See what SQL will be generated
query.explain()
Troubleshooting
Issue: Operation slower than expected
# Check current engine
print(config.execution_engine)
# Enable debug to see what's happening
config.enable_debug()
# Try forcing specific engine
config.use_chdb() # or config.use_pandas()
Issue: Unsupported operation in chdb mode
# Some pandas operations aren't supported in SQL
# Solution: use auto mode
config.use_auto()
# Or explicitly convert to pandas first
df = ds.to_df()
result = df.some_pandas_specific_operation()
Issue: Memory issues with large data
# Use chdb engine to avoid loading all data into memory
config.use_chdb()
# Filter early to reduce data size
result = ds.filter(ds['date'] >= '2024-01-01').to_df()
# For maximum throughput on large datasets, use performance mode
# which enables parallel Parquet reading and single-SQL aggregation
config.use_performance_mode()
Performance ModeIf you are running heavy aggregation workloads and don’t need exact pandas output compatibility (row order, MultiIndex, dtype corrections), consider using Performance Mode. It automatically sets the engine to chdb and removes all pandas compatibility overhead.