Save & Versioning
Basic Save
import dfstore
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
dfstore.save(df, name="my_data")
Every call to save() with the same name creates a new version. The old data is never overwritten.
Parameters
| Parameter | Type | Description |
|---|---|---|
df |
pd.DataFrame or pl.DataFrame |
The DataFrame to save |
name |
str |
Unique name (letters, numbers, _, -; max 128 chars) |
description |
str |
Optional human-readable description |
tags |
list |
Optional list of tags (see below) |
notes |
str |
Optional version-specific note |
store_path |
Path or str |
Override the default store location |
Adding Metadata
dfstore.save(
df,
name="sales_2024",
description="Annual sales data by region",
tags=["finance", "region=EU"],
notes="Includes Q4 revision",
)
Tags
Tags can be plain strings or key-value pairs:
# Plain string tags
dfstore.save(df, name="ds", tags=["raw", "finance"])
# Key-value dict tags
dfstore.save(df, name="ds", tags=[{"env": "production"}, {"team": "data"}])
# Mixed
dfstore.save(df, name="ds", tags=["finance", {"env": "production"}])
Versioning
Each save increments the version counter automatically:
dfstore.save(df_v1, name="dataset") # version 1
dfstore.save(df_v2, name="dataset") # version 2
dfstore.save(df_v3, name="dataset") # version 3
The VersionRecord returned by save() contains the new version number plus diff information relative to the previous version:
vr = dfstore.save(df_v2, name="dataset")
print(vr.version) # 2
print(vr.shape) # (100, 5)
print(vr.row_diff) # +20 (relative to v1)
print(vr.columns_added) # ['new_col']
print(vr.columns_removed) # []
Saving Polars DataFrames
dfstore works transparently with polars:
import polars as pl
df = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
dfstore.save(df, name="polars_data", description="A polars DataFrame")
You can load it back as either pandas or polars — see Load Data.
Return Value
save() returns a VersionRecord:
vr = dfstore.save(df, name="data")
vr.version # int — version number
vr.saved_at # datetime — UTC timestamp
vr.shape # tuple[int, int] — (rows, cols)
vr.columns # list[str]
vr.dtypes # dict[str, str]
vr.null_counts # dict[str, int]
vr.notes # str
vr.row_diff # int — row delta vs previous version (0 for v1)
vr.columns_added # list[str]
vr.columns_removed # list[str]
vr.library # 'pandas' or 'polars'