Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Function-Level Configuration
DataStore allows fine-grained control over execution at the function level, including engine selection and Dtype correction.
Function Engine Configuration
Override the execution engine for specific functions.
Setting Function Engines
from chdb.datastore.config import function_config
# Force specific functions to use chdb
function_config.use_chdb('length', 'substring', 'concat')
# Force specific functions to use pandas
function_config.use_pandas('upper', 'lower', 'capitalize')
# Set default preference
function_config.prefer_chdb() # Default to chdb
function_config.prefer_pandas() # Default to pandas
# Reset to auto
function_config.reset()
When to Use
Force chdb for:
- Functions with better ClickHouse performance
- Functions that benefit from SQL optimization
- Large-scale string/datetime operations
Force pandas for:
- Functions with pandas-specific behavior
- When exact pandas compatibility is required
- Custom string operations
Example
from chdb import datastore as pd
from chdb.datastore.config import function_config
# Configure function engines
function_config.use_chdb('length', 'substring')
function_config.use_pandas('upper')
ds = pd.read_csv("data.csv")
# length() will use chdb
ds['name_len'] = ds['name'].str.len()
# substring() will use chdb
ds['prefix'] = ds['name'].str.slice(0, 3)
# upper() will use pandas
ds['name_upper'] = ds['name'].str.upper()
Overlapping Functions
159+ functions are available in both chdb and pandas engines:
| Category | Functions |
|---|
| String | length, upper, lower, trim, ltrim, rtrim, concat, substring, replace, reverse, contains, startswith, endswith |
| Math | abs, round, floor, ceil, exp, log, log10, sqrt, pow, sin, cos, tan |
| DateTime | year, month, day, hour, minute, second, dayofweek, dayofyear, quarter |
| Aggregation | sum, avg, min, max, count, std, var, median |
For overlapping functions, the engine is selected based on:
- Explicit function configuration (if set)
- Global execution_engine setting
- Auto-selection based on context
chdb-Only Functions
Some functions are only available through ClickHouse:
| Category | Functions |
|---|
| Array | arraySum, arrayAvg, arraySort, arrayDistinct, groupArray, arrayElement |
| JSON | JSONExtractString, JSONExtractInt, JSONExtractFloat, JSONHas |
| URL | domain, path, protocol, extractURLParameter |
| IP | IPv4StringToNum, IPv4NumToString, isIPv4String |
| Geo | greatCircleDistance, geoDistance, geoToH3 |
| Hash | cityHash64, xxHash64, sipHash64, MD5, SHA256 |
| Conditional | sumIf, countIf, avgIf, minIf, maxIf |
These functions automatically use chdb engine regardless of configuration.
pandas-Only Functions
Some functions are only available through pandas:
| Category | Functions |
|---|
| Apply | Custom lambda functions, user-defined functions |
| Complex Pivot | Pivot tables with custom aggregations |
| Stack/Unstack | Complex reshaping operations |
| Interpolate | Time series interpolation methods |
These functions automatically use pandas engine regardless of configuration.
Dtype Correction
Configure how DataStore corrects data types between engines.
Correction Levels
from chdb.datastore.dtype_correction.config import CorrectionLevel
from chdb.datastore.config import config
# No correction
config.set_correction_level(CorrectionLevel.NONE)
# Critical types only (NULL handling, boolean)
config.set_correction_level(CorrectionLevel.CRITICAL)
# High priority (default) - common type mismatches
config.set_correction_level(CorrectionLevel.HIGH)
# Medium - more aggressive correction
config.set_correction_level(CorrectionLevel.MEDIUM)
# All - correct all possible types
config.set_correction_level(CorrectionLevel.ALL)
Correction Level Details
| Level | Description | Types Corrected |
|---|
NONE | No automatic correction | None |
CRITICAL | Essential corrections | NULL handling, boolean conversion |
HIGH (default) | Common corrections | Integer/float precision, datetime, string encoding |
MEDIUM | More corrections | Decimal precision, timezone handling |
ALL | Maximum correction | All type differences |
When Types Need Correction
Type differences can occur when:
- ClickHouse → pandas: Different integer sizes (Int64 vs int64)
- pandas → ClickHouse: Python objects to SQL types
- NULL handling: pandas NA vs ClickHouse NULL
- Boolean: Different boolean representations
- DateTime: Timezone differences
Example
from chdb.datastore.dtype_correction.config import CorrectionLevel
from chdb.datastore.config import config
# Strict mode - expect exact type matches
config.set_correction_level(CorrectionLevel.NONE)
# Relaxed mode - auto-fix type issues
config.set_correction_level(CorrectionLevel.ALL)
Function Configuration API
function_config Object
from chdb.datastore.config import function_config
# Force engine for functions
function_config.use_chdb(*function_names)
function_config.use_pandas(*function_names)
# Set default preference
function_config.prefer_chdb()
function_config.prefer_pandas()
# Reset to default (auto)
function_config.reset()
# Check configuration
function_config.get_engine('length') # Returns 'chdb', 'pandas', or 'auto'
Per-Call Override
Some methods support per-call engine override:
# Using engine parameter (where supported)
ds['result'] = ds['col'].str.upper(engine='pandas')
Best Practices
- Start with Defaults
# Use auto mode, let DataStore decide
config.use_auto()
# For ClickHouse-optimized string processing
function_config.use_chdb('length', 'substring', 'concat')
# For pandas-compatible string behavior
function_config.use_pandas('upper', 'lower')
- Use Appropriate Correction Level
# Development: more permissive
config.set_correction_level(CorrectionLevel.ALL)
# Production: stricter
config.set_correction_level(CorrectionLevel.HIGH)
- Test Both Engines
# Test with chdb
config.use_chdb()
result_chdb = process_data()
# Test with pandas
config.use_pandas()
result_pandas = process_data()
# Compare results
assert result_chdb.equals(result_pandas)