federleicht is a Python package providing a cache decorator for pandas.DataFrame, utilizing the lightweight and efficient pyarrow feather file format.
federleicht.cache_dataframe is designed to decorate functions that return pandas.DataFrame objects. The decorator saves the DataFrame to a feather file on the first call and loads it automatically on subsequent calls if the file exists.
pandas.DataFrame effortlessly using the Feather format, known for its speed and simplicity.To implement cache expiry, federleicht requires all arguments of the decorated function to be serializable. The cache will expire under the following conditions:
args / kwargs) of the decorated function change.os.PathLike object is passed as an argument, the cache will expire if the file size and / or modification time changes.timedelta.os.PathLikepandas.DataFramepandas.Seriesnumpy.ndarraydatetime.datetimetypes.FunctionTypeInstall federleicht from PyPI:
pip install federleichtNormally, md5 is used for hashing the arguments, but for even faster hashing, you can try xxhash as an optional dependency:
pip install federleicht[xxhash]Here's a quick example:
import pandas as pd
from federleicht import cache_dataframe
@cache_dataframe
def generate_large_dataframe():
# Simulate a heavy computation
return pd.DataFrame({"col1": range(10000), "col2": range(10000)})
df = generate_large_dataframe()Functions which are used to benchmark the performance of the cache_dataframe decorator.
def read_data(file: str, **kwargs) -> pd.DataFrame:
"""
Read the earthquake dataset from a CSV file to Benchmark cache.
Perform some data type conversions and return the DataFrame.
"""
df = pd.read_csv(
file,
header=0,
dtype={
"status": "category",
"tsunami": "boolean",
"data_type": "category",
"state": "category",
},
**kwargs,
)
df["time"] = pd.to_datetime(df["time"], unit="ms")
df["date"] = pd.to_datetime(df["date"], format="mixed")
return dfThe pandas.DataFrame without the attrs dictionary will be cached in the .pandas_cache directory and will only expire if the file changes. For more details, see the Cache Expiry section.
@cache_dataframe
def read_cache(file: pathlib.Path, **kwargs) -> pd.DataFrame:
return read_data(file, **kwargs)Results strongly depend on the system configuration and the file system. The following results are obtained on:
| nrows | read_data [s] | build_cache [s] | read_cache [s] |
|---|---|---|---|
| 10000 | 0.060 | 0.076 | 0.037 |
| 32170 | 0.172 | 0.193 | 0.033 |
| 103493 | 0.536 | 0.569 | 0.067 |
| 332943 | 1.658 | 1.791 | 0.143 |
| 1071093 | 5.383 | 5.465 | 0.366 |
| 3445752 | 16.750 | 17.720 | 1.141 |