Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Pandas Compatibility
DataStore implements 209 pandas DataFrame methods for full API compatibility. Your existing pandas code works with minimal changes.
Compatibility Approach
# Typical migration - just change the import
- import pandas as pd
+ from chdb import datastore as pd
# Your code works unchanged
df = pd.read_csv("data.csv")
result = df[df['age'] > 25].groupby('city')['salary'].mean()
Key principles:
- All 209 pandas DataFrame methods implemented
- Lazy evaluation for SQL optimization
- Automatic type wrapping (DataFrame → DataStore, Series → ColumnExpr)
- Immutable operations (no
inplace=True)
Attributes and Properties
| Property | Description | Triggers Execution |
|---|
shape | (rows, columns) tuple | Yes |
columns | Column names (Index) | Yes |
dtypes | Column data types | Yes |
values | NumPy array | Yes |
index | Row index | Yes |
size | Number of elements | Yes |
ndim | Number of dimensions | No |
empty | Is DataFrame empty | Yes |
T | Transpose | Yes |
axes | List of axes | Yes |
Examples:
from chdb import datastore as pd
ds = pd.read_csv("data.csv")
print(ds.shape) # (1000, 5)
print(ds.columns) # Index(['name', 'age', 'city', 'salary', 'dept'])
print(ds.dtypes) # name: object, age: int64, ...
print(ds.empty) # False
Indexing and Selection
| Method | Description | Example |
|---|
df['col'] | Select column | ds['age'] |
df[['col1', 'col2']] | Select columns | ds[['name', 'age']] |
df[condition] | Boolean indexing | ds[ds['age'] > 25] |
df.loc[...] | Label-based access | ds.loc[0:10, 'name'] |
df.iloc[...] | Integer-based access | ds.iloc[0:10, 0:3] |
df.at[...] | Single value by label | ds.at[0, 'name'] |
df.iat[...] | Single value by position | ds.iat[0, 0] |
df.head(n) | First n rows | ds.head(10) |
df.tail(n) | Last n rows | ds.tail(10) |
df.sample(n) | Random sample | ds.sample(100) |
df.select_dtypes() | Select by Dtype | ds.select_dtypes(include='number') |
df.query() | Query expression | ds.query('age > 25') |
df.where() | Conditional replace | ds.where(ds['age'] > 0, 0) |
df.mask() | Inverse where | ds.mask(ds['age'] < 0, 0) |
df.isin() | Value membership | ds['city'].isin(['NYC', 'LA']) |
df.get() | Safe column access | ds.get('col', default=None) |
df.xs() | Cross-section | ds.xs('key') |
df.pop() | Remove column | ds.pop('col') |
Statistical Methods
| Method | Description | SQL Equivalent |
|---|
mean() | Mean value | AVG() |
median() | Median value | MEDIAN() |
mode() | Mode value | - |
std() | Standard deviation | STDDEV() |
var() | Variance | VAR() |
min() | Minimum | MIN() |
max() | Maximum | MAX() |
sum() | Sum | SUM() |
prod() | Product | - |
count() | Non-null count | COUNT() |
nunique() | Unique count | UNIQ() |
value_counts() | Value frequencies | GROUP BY |
quantile() | Quantile | QUANTILE() |
describe() | Summary statistics | - |
corr() | Correlation matrix | CORR() |
cov() | Covariance matrix | COV() |
corrwith() | Pairwise correlation | - |
rank() | Rank values | RANK() |
abs() | Absolute values | ABS() |
round() | Round values | ROUND() |
clip() | Clip values | - |
cumsum() | Cumulative sum | Window function |
cumprod() | Cumulative product | Window function |
cummin() | Cumulative min | Window function |
cummax() | Cumulative max | Window function |
diff() | Difference | Window function |
pct_change() | Percent change | Window function |
skew() | Skewness | SKEW() |
kurt() | Kurtosis | KURT() |
sem() | Standard error | - |
all() | All true | - |
any() | Any true | - |
idxmin() | Index of min | - |
idxmax() | Index of max | - |
Examples:
ds = pd.read_csv("data.csv")
# Basic statistics
print(ds['salary'].mean())
print(ds['age'].std())
print(ds.describe())
# Group statistics
print(ds.groupby('department')['salary'].mean())
print(ds.groupby('city').agg({'salary': ['mean', 'std'], 'age': 'count'}))
Data Manipulation
| Method | Description |
|---|
drop() | Drop rows/columns |
drop_duplicates() | Remove duplicates |
duplicated() | Mark duplicates |
dropna() | Remove missing values |
fillna() | Fill missing values |
ffill() | Forward fill |
bfill() | Backward fill |
interpolate() | Interpolate values |
replace() | Replace values |
rename() | Rename columns/index |
rename_axis() | Rename axis |
assign() | Add new columns |
astype() | Convert types |
convert_dtypes() | Infer types |
copy() | Copy DataFrame |
Examples:
ds = pd.read_csv("data.csv")
# Drop operations
result = ds.drop(columns=['unused_col'])
result = ds.drop_duplicates(subset=['user_id'])
result = ds.dropna(subset=['email'])
# Fill operations
result = ds.fillna(0)
result = ds.fillna({'age': 0, 'name': 'Unknown'})
# Transform operations
result = ds.rename(columns={'old_name': 'new_name'})
result = ds.assign(
full_name=lambda x: x['first_name'] + ' ' + x['last_name'],
age_group=lambda x: pd.cut(x['age'], bins=[0, 25, 50, 100])
)
Sorting and Ranking
| Method | Description |
|---|
sort_values() | Sort by values |
sort_index() | Sort by index |
nlargest() | N largest values |
nsmallest() | N smallest values |
Examples:
# Sort by single column
result = ds.sort_values('salary', ascending=False)
# Sort by multiple columns
result = ds.sort_values(['department', 'salary'], ascending=[True, False])
# Get top/bottom N
result = ds.nlargest(10, 'salary')
result = ds.nsmallest(5, 'age')
Reshaping
| Method | Description |
|---|
pivot() | Pivot table |
pivot_table() | Pivot with aggregation |
melt() | Unpivot |
stack() | Stack columns to index |
unstack() | Unstack index to columns |
transpose() / T | Transpose |
explode() | Explode lists to rows |
squeeze() | Reduce dimensions |
droplevel() | Drop index level |
swaplevel() | Swap index levels |
reorder_levels() | Reorder levels |
Examples:
# Pivot table
result = ds.pivot_table(
values='amount',
index='region',
columns='product',
aggfunc='sum'
)
# Melt (unpivot)
result = ds.melt(
id_vars=['name'],
value_vars=['score1', 'score2', 'score3'],
var_name='test',
value_name='score'
)
# Explode arrays
result = ds.explode('tags')
Combining / Joining
| Method | Description |
|---|
merge() | SQL-style merge |
join() | Join on index |
concat() | Concatenate |
append() | Append rows |
combine() | Combine with function |
combine_first() | Combine with priority |
update() | Update values |
compare() | Show differences |
Examples:
# Merge (join)
result = pd.merge(df1, df2, on='id', how='left')
result = df1.join(df2, on='id')
# Concatenate
result = pd.concat([df1, df2, df3])
result = pd.concat([df1, df2], axis=1)
Binary Operations
| Method | Description |
|---|
add() / radd() | Addition |
sub() / rsub() | Subtraction |
mul() / rmul() | Multiplication |
div() / rdiv() | Division |
truediv() / rtruediv() | True division |
floordiv() / rfloordiv() | Floor division |
mod() / rmod() | Modulo |
pow() / rpow() | Power |
dot() | Matrix multiplication |
Examples:
# Arithmetic operations
result = ds['col1'].add(ds['col2'])
result = ds['price'].mul(ds['quantity'])
# With fill_value for missing data
result = ds['col1'].add(ds['col2'], fill_value=0)
Comparison Operations
| Method | Description |
|---|
eq() | Equal |
ne() | Not equal |
lt() | Less than |
le() | Less than or equal |
gt() | Greater than |
ge() | Greater than or equal |
equals() | Test equality |
compare() | Show differences |
Function Application
| Method | Description |
|---|
apply() | Apply function |
applymap() | Apply element-wise |
map() | Map values |
agg() / aggregate() | Aggregate |
transform() | Transform |
pipe() | Pipe functions |
groupby() | Group by |
Examples:
# Apply function
result = ds['name'].apply(lambda x: x.upper())
result = ds.apply(lambda row: row['a'] + row['b'], axis=1)
# Aggregate
result = ds.agg({'col1': 'sum', 'col2': 'mean'})
result = ds.agg(['sum', 'mean', 'std'])
# Pipe
result = (ds
.pipe(filter_active)
.pipe(calculate_metrics)
.pipe(format_output)
)
Time Series
| Method | Description |
|---|
rolling() | Rolling window |
expanding() | Expanding window |
ewm() | Exponentially weighted |
resample() | Resample time series |
shift() | Shift values |
asfreq() | Convert frequency |
asof() | Latest value as of |
at_time() | Select at time |
between_time() | Select time range |
first() / last() | First/last periods |
to_period() | Convert to period |
to_timestamp() | Convert to timestamp |
tz_convert() | Convert timezone |
tz_localize() | Localize timezone |
Examples:
# Rolling window
result = ds['value'].rolling(window=7).mean()
# Expanding window
result = ds['value'].expanding().sum()
# Shift
result = ds['value'].shift(1) # Lag
result = ds['value'].shift(-1) # Lead
Missing Data
| Method | Description |
|---|
isna() / isnull() | Detect missing |
notna() / notnull() | Detect non-missing |
dropna() | Drop missing |
fillna() | Fill missing |
ffill() | Forward fill |
bfill() | Backward fill |
interpolate() | Interpolate |
replace() | Replace values |
I/O Methods
| Method | Description |
|---|
to_csv() | Export to CSV |
to_json() | Export to JSON |
to_excel() | Export to Excel |
to_parquet() | Export to Parquet |
to_feather() | Export to Feather |
to_sql() | Export to SQL database |
to_pickle() | Pickle |
to_html() | HTML table |
to_latex() | LaTeX table |
to_markdown() | Markdown table |
to_string() | String representation |
to_dict() | Dictionary |
to_records() | Records |
to_numpy() | NumPy array |
to_clipboard() | Clipboard |
See I/O Operations for detailed documentation.
Iteration
| Method | Description |
|---|
items() | Iterate (column, Series) |
iterrows() | Iterate (index, Series) |
itertuples() | Iterate as named tuples |
Key Differences from Pandas
- Return Types
# Pandas returns Series
pdf['col'] # → pd.Series
# DataStore returns ColumnExpr (lazy)
ds['col'] # → ColumnExpr
- Lazy Execution
# DataStore operations are lazy
result = ds.filter(ds['age'] > 25) # Not executed yet
df = result.to_df() # Executed here
- No inplace Parameter
# Pandas
df.drop(columns=['col'], inplace=True)
# DataStore (always returns new object)
ds = ds.drop(columns=['col'])
- Comparing Results
# Use to_pandas() for comparison
pd.testing.assert_frame_equal(
ds.to_pandas(),
expected_df
)
See Key Differences for complete details.