Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Key Differences from pandas
While DataStore is highly compatible with pandas, there are important differences to understand.
Summary Table
| Aspect | pandas | DataStore |
|---|
| Execution | Eager (immediate) | Lazy (deferred) |
| Return types | DataFrame/Series | DataStore/ColumnExpr |
| Row order | Preserved | Preserved (automatic); not guaranteed in performance mode |
| inplace | Supported | Not supported |
| Index | Full support | Simplified |
| Memory | All data in memory | Data at source |
- Lazy vs Eager Execution
pandas (Eager)
Operations execute immediately:
import pandas as pd
df = pd.read_csv("data.csv") # Loads entire file NOW
result = df[df['age'] > 25] # Filters NOW
grouped = result.groupby('city')['salary'].mean() # Aggregates NOW
DataStore (Lazy)
Operations are deferred until results are needed:
from chdb import datastore as pd
ds = pd.read_csv("data.csv") # Just records the source
result = ds[ds['age'] > 25] # Just records the filter
grouped = result.groupby('city')['salary'].mean() # Just records
# Execution happens here:
print(grouped) # Executes when displaying
df = grouped.to_df() # Or when converting to pandas
Why It Matters
Lazy execution enables:
- Query optimization: Multiple operations compile to one SQL query
- Column pruning: Only needed columns are read
- Filter pushdown: Filters apply at the source
- Memory efficiency: Don’t load data you don’t need
- Return Types
pandas
df['col'] # Returns pd.Series
df[['a', 'b']] # Returns pd.DataFrame
df[df['x'] > 10] # Returns pd.DataFrame
df.groupby('x') # Returns DataFrameGroupBy
DataStore
ds['col'] # Returns ColumnExpr (lazy)
ds[['a', 'b']] # Returns DataStore (lazy)
ds[ds['x'] > 10] # Returns DataStore (lazy)
ds.groupby('x') # Returns LazyGroupBy
Converting to pandas Types
# Get pandas DataFrame
df = ds.to_df()
df = ds.to_pandas()
# Get pandas Series from column
series = ds['col'].to_pandas()
# Or trigger execution
print(ds) # Automatically converts for display
- Execution Triggers
DataStore executes when you need actual values:
| Trigger | Example | Notes |
|---|
print() / repr() | print(ds) | Display needs data |
len() | len(ds) | Need row count |
.columns | ds.columns | Need column names |
.dtypes | ds.dtypes | Need type info |
.shape | ds.shape | Need dimensions |
.values | ds.values | Need actual data |
.index | ds.index | Need index |
to_df() | ds.to_df() | Explicit conversion |
| Iteration | for row in ds | Need to iterate |
equals() | ds.equals(other) | Need comparison |
Operations That Stay Lazy
| Operation | Returns |
|---|
filter() | DataStore |
select() | DataStore |
sort() | DataStore |
groupby() | LazyGroupBy |
join() | DataStore |
ds['col'] | ColumnExpr |
ds[['a', 'b']] | DataStore |
ds[condition] | DataStore |
- Row Order
pandas
Row order is always preserved:
df = pd.read_csv("data.csv")
print(df.head()) # Always same order as file
DataStore
Row order is automatically preserved for most operations:
ds = pd.read_csv("data.csv")
print(ds.head()) # Matches file order
# Filter preserves order
ds_filtered = ds[ds['age'] > 25] # Same order as pandas
DataStore automatically tracks original row positions internally (using rowNumberInAllBlocks()) to ensure order consistency with pandas.
When Order Is Preserved
- File sources (CSV, Parquet, JSON, etc.)
- pandas DataFrame sources
- Filter operations
- Column selection
- After explicit
sort() or sort_values()
- Operations that define order (
nlargest(), nsmallest(), head(), tail())
When Order May Differ
- After
groupby() aggregations (use sort_values() to ensure consistent order)
- After
merge() / join() with certain join types
- In performance mode (
config.use_performance_mode()): row order is not guaranteed for any operation. See Performance Mode.
- No inplace Parameter
pandas
df.drop(columns=['col'], inplace=True) # Modifies df
df.fillna(0, inplace=True) # Modifies df
df.rename(columns={'old': 'new'}, inplace=True)
DataStore
inplace=True is not supported. Always assign the result:
ds = ds.drop(columns=['col']) # Returns new DataStore
ds = ds.fillna(0) # Returns new DataStore
ds = ds.rename(columns={'old': 'new'}) # Returns new DataStore
Why No inplace?
DataStore uses immutable operations to enable:
- Query building (lazy evaluation)
- Thread safety
- Easier debugging
- Cleaner code
- Index Support
pandas
Full index support:
df = df.set_index('id')
df.loc['user123'] # Label-based access
df.loc['a':'z'] # Label-based slicing
df.reset_index()
df.index.name = 'user_id'
DataStore
Simplified index support:
# Basic operations work
ds.loc[0:10] # Integer position
ds.iloc[0:10] # Same as loc for DataStore
# For pandas-style index operations, convert first
df = ds.to_df()
df = df.set_index('id')
df.loc['user123']
DataStore Source Matters
- DataFrame source: Preserves pandas index
- File source: Uses simple integer index
- Comparison Behavior
Comparing with pandas
pandas doesn’t recognize DataStore objects:
import pandas as pd
from chdb import datastore as ds
pdf = pd.DataFrame({'a': [1, 2, 3]})
dsf = ds.DataFrame({'a': [1, 2, 3]})
# This doesn't work as expected
pdf == dsf # pandas doesn't know DataStore
# Solution: convert DataStore to pandas
pdf.equals(dsf.to_pandas()) # True
Using equals()
# DataStore.equals() also works
dsf.equals(pdf) # Compares with pandas DataFrame
- Type Inference
pandas
Uses numpy/pandas types:
df['col'].dtype # int64, float64, object, datetime64, etc.
DataStore
May use ClickHouse types:
ds['col'].dtype # Int64, Float64, String, DateTime, etc.
# Types are converted when going to pandas
df = ds.to_df()
df['col'].dtype # Now pandas type
Explicit Casting
# Force specific type
ds['col'] = ds['col'].astype('int64')
- Memory Model
pandas
All data lives in memory:
df = pd.read_csv("huge.csv") # 10GB in memory!
DataStore
Data stays at source until needed:
ds = pd.read_csv("huge.csv") # Just metadata
ds = ds.filter(ds['year'] == 2024) # Still just metadata
# Only filtered result is loaded
df = ds.to_df() # Maybe only 1GB now
- Error Messages
Different Error Sources
- pandas errors: From pandas library
- DataStore errors: From chDB or ClickHouse
# May see ClickHouse-style errors
# "Code: 62. DB::Exception: Syntax error..."
Debugging Tips
# View the SQL to debug
print(ds.to_sql())
# See execution plan
ds.explain()
# Enable debug logging
from chdb.datastore.config import config
config.enable_debug()
Migration Checklist
When migrating from pandas:
Quick Reference
| pandas | DataStore |
|---|
df[condition] | Same (returns DataStore) |
df.groupby() | Same (returns LazyGroupBy) |
df.drop(inplace=True) | ds = ds.drop() |
df.equals(other) | ds.to_pandas().equals(other) |
df.loc['label'] | ds.to_df().loc['label'] |
print(df) | Same (triggers execution) |
len(df) | Same (triggers execution) |