tsfresh
tsfresh extracts relevant characteristics from time series data, enabling automated feature engineering for machine learning tasks. It supports a wide range of feature calculators, parallel processing, and integrated feature selection. The current version is 0.21.1, and it typically releases new versions every few months, often including bug fixes, dependency updates, and occasionally breaking changes.
Common errors
-
ModuleNotFoundError: No module named 'matrixprofile'
cause Attempting to extract features that depend on the `matrixprofile` library (e.g., `matrix_profile`) without having it installed.fixInstall the optional `matrixprofile` dependency: `pip install tsfresh[matrix_profile]` or `pip install matrixprofile`. -
RuntimeError: Please install dask and distributed for parallel processing.
cause You are trying to use parallel feature extraction (`n_jobs > 1` or `n_jobs=0`) but the Dask and Distributed libraries are not installed.fixInstall the optional Dask/Distributed dependencies: `pip install tsfresh[dask]` or `pip install dask distributed`. -
ValueError: column_id not found in dataframe
cause The DataFrame passed to `extract_features` does not contain a column with the name specified by `column_id`.fixEnsure your DataFrame has a column named 'id' (or whatever you pass to `column_id`) and that it correctly identifies individual time series. -
TypeError: Cannot convert float NaN to integer
cause This often occurs when feature calculators expect integer inputs but encounter NaN values in the time series data. While `tsfresh` tries to handle NaNs, some specific cases or older versions might not.fixEnsure `impute_function=impute` is passed to `extract_features`. Also, consider preprocessing your data to handle NaNs explicitly before passing it to `tsfresh` if the issue persists.
Warnings
- breaking tsfresh v0.21.0 dropped support for Python 3.7 and 3.8. v0.19.0 dropped Python 3.6. Ensure your Python environment is 3.9 or newer.
- breaking The `matrixprofile` package became an optional dependency in v0.20.0. If you use features relying on matrix profile without installing it, you will encounter `ModuleNotFoundError`.
- gotcha Parallelization with `n_jobs > 1` (default `n_jobs=0` uses all cores) requires Dask and Distributed. Without them, you'll receive a `RuntimeError` if parallelization is attempted.
- gotcha Compatibility issues with `scipy` versions 1.15 and higher were fixed in `tsfresh v0.21.0` by relying on the `pywavelets` package for CWT. Older `tsfresh` versions or environments without `pywavelets` might fail.
- gotcha `tsfresh v0.20.1` added compatibility with NumPy 1.24 and Pandas 2.0. Using older `tsfresh` versions with newer NumPy/Pandas might lead to unexpected errors or warnings related to API changes.
Install
-
pip install tsfresh -
pip install tsfresh[dask,matrix_profile,pywavelets]
Imports
- extract_features
from tsfresh import extract_features
- select_features
from tsfresh import select_features
- impute
from tsfresh.utilities.dataframe_functions import impute
- MinimalFCParameters
from tsfresh.feature_extraction import MinimalFCParameters
- EfficientFCParameters
from tsfresh.feature_extraction import EfficientFCParameters
- ComprehensiveFCParameters
from tsfresh.feature_extraction import ComprehensiveFCParameters
Quickstart
import pandas as pd
from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import MinimalFCParameters
# Create a sample time series DataFrame
# 'id' identifies different time series
# 'time' is the time index within each series (can be datetime or int)
# 'value' is the measurement
df = pd.DataFrame({
'id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'time': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'value': [10, 12, 11, 5, 6, 7, 8, 8, 9]
})
# Define feature extraction settings (e.g., Minimal for speed)
settings = MinimalFCParameters()
# Extract features
# impute_function is recommended to handle NaN values gracefully
features = extract_features(df,
column_id='id',
column_sort='time',
impute_function=impute,
default_fc_parameters=settings,
n_jobs=0) # Use all CPU cores for parallelization
print("Extracted Features:")
print(features.head())