Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
DataStore Execution Model
Understanding DataStore’s lazy evaluation model is key to using it effectively and achieving optimal performance.Lazy Evaluation
DataStore uses lazy evaluation - operations are not executed immediately but are recorded and compiled into optimized SQL queries. Execution happens only when results are actually needed.Example: Lazy vs Eager
Benefits of Lazy Evaluation
- Query Optimization: Multiple operations compile to a single optimized SQL query
- Filter Pushdown: Filters are applied at the data source level
- Column Pruning: Only needed columns are read
- Deferred Decisions: Execution engine can be chosen at runtime
- Plan Inspection: You can view/debug the query before executing
Execution Triggers
Execution is triggered automatically when you need actual values:Automatic Triggers
| Trigger | Example | Description |
|---|---|---|
print() / repr() | print(ds) | Display results |
len() | len(ds) | Get row count |
.columns | ds.columns | Get column names |
.dtypes | ds.dtypes | Get column types |
.shape | ds.shape | Get dimensions |
.index | ds.index | Get row index |
.values | ds.values | Get NumPy array |
| Iteration | for row in ds | Iterate over rows |
to_df() | ds.to_df() | Convert to pandas |
to_pandas() | ds.to_pandas() | Alias for to_df |
to_dict() | ds.to_dict() | Convert to dict |
to_numpy() | ds.to_numpy() | Convert to array |
.equals() | ds.equals(other) | Compare DataStores |
Operations That Stay Lazy
| Operation | Returns | Description |
|---|---|---|
filter() | DataStore | Adds WHERE clause |
select() | DataStore | Adds column selection |
sort() | DataStore | Adds ORDER BY |
groupby() | LazyGroupBy | Prepares GROUP BY |
join() | DataStore | Adds JOIN |
ds['col'] | ColumnExpr | Column reference |
ds[['col1', 'col2']] | DataStore | Column selection |
Three-Phase Execution
DataStore operations follow a three-phase execution model:Phase 1: SQL Query Building (Lazy)
Operations that can be expressed in SQL are accumulated:Phase 2: Execution Point
When a trigger occurs, the accumulated SQL is executed:Phase 3: DataFrame Operations (if any)
If you chain pandas-only operations after execution:Viewing Execution Plans
Useexplain() to see what will be executed:
verbose=True for more details:
Caching
DataStore caches execution results to avoid redundant queries.How Caching Works
Cache Invalidation
Cache is invalidated when operations modify the DataStore:Manual Cache Control
Mixing SQL and Pandas Operations
DataStore intelligently handles operations that mix SQL and pandas:SQL-Compatible Operations
These compile to SQL:filter(),where()select()groupby(),agg()sort(),orderby()limit(),offset()join(),union()distinct()- Column operations (math, comparison, string methods)
Pandas-Only Operations
These trigger execution and use pandas:apply()with custom functionspivot_table()with complex aggregationsstack(),unstack()- Operations on executed DataFrames
Hybrid Pipelines
Execution Engine Selection
DataStore can execute operations using different engines:Auto Mode (Default)
Force chDB Engine
Force pandas Engine
Performance Implications
Good: Filter Early
Bad: Filter Late
Good: Select Columns Early
Good: Let SQL Do the Work
Best Practices Summary
- Chain operations before executing - Build the full query, then trigger once
- Filter early - Reduce data at the source
- Select only needed columns - Column pruning improves performance
- Use
explain()to understand execution - Debug before running - Let SQL handle aggregations - ClickHouse is optimized for this
- Be aware of execution triggers - Avoid accidental early execution
- Use caching wisely - Understand when cache is invalidated