datasieve: Flexible Data Pipeline
The `datasieve` package provides a flexible data pipeline inspired by scikit-learn's Pipeline, but with enhanced capabilities to manipulate `y` (target) and `sample_weight` arrays alongside `X` (features). This is particularly useful for tasks such as removing outliers across all associated data, removing feature columns based on arbitrary criteria, and handling dynamic feature renaming within the pipeline. The current version is 0.1.9, with releases occurring on an irregular, as-needed basis.
Common errors
-
No module named 'datasieve'
cause The datasieve library is not installed in the current Python environment.fixRun `pip install datasieve` to install the package. If using a virtual environment, ensure it is activated before installation. -
Exception: Pipeline expected Index(...) but got Index(...)
cause This error typically occurs when the column names (features) of the input data to a pipeline step do not match the expected features that the step was fitted on. This often happens if columns are dropped or renamed unexpectedly between `fit` and `transform` calls, or if the test set has different columns than the training set.fixVerify that the feature names (column names if using pandas DataFrames) of the data being passed through the pipeline are consistent with what the pipeline expects based on its `fit` operation. Ensure that any manual feature engineering or selection outside the `datasieve` pipeline is applied consistently to both training and test data.
Warnings
- gotcha Input data (X, y, sample_weight) must maintain consistent row order and indices if DataFrames/Series are used. `datasieve` transforms operate on the assumption of this consistency, especially when removing rows (e.g., outliers) to propagate changes correctly across all inputs.
- gotcha Mismatch in the number or names of features between `fit` and `transform` calls can lead to errors, particularly after feature selection or dimensionality reduction steps within the pipeline.
Install
-
pip install datasieve
Imports
- Pipeline
from datasieve.pipeline import Pipeline
- transforms
import datasieve.transforms as dst
- SKlearnWrapper
from datasieve.transforms import SKlearnWrapper
Quickstart
import pandas as pd
from datasieve.pipeline import Pipeline
import datasieve.transforms as dst
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Create sample data
X_data = pd.DataFrame(np.random.rand(100, 5), columns=[f'feature_{i}' for i in range(5)])
y_data = pd.Series(np.random.randint(0, 2, 100))
sample_weights = np.random.rand(100)
# Introduce some zero-variance feature and outliers for demonstration
X_data['feature_0'] = 1.0 # Zero variance
X_data.iloc[0:5, 1] = 1000.0 # Outliers
# Build the datasieve pipeline
feature_pipeline = Pipeline([
("detect_constants", dst.VarianceThreshold(threshold=0)), # Removes zero-variance features
("pre_svm_scaler", dst.SKlearnWrapper(MinMaxScaler(feature_range=(-1, 1)))),
("svm_outlier_extractor", dst.SVMOutlierExtractor(nu=0.1, kernel='rbf')),
("pca", dst.PCA(n_components=0.95)) # Dimensionality reduction
])
# Fit and transform the data
X_transformed, y_transformed, sample_weights_transformed = \
feature_pipeline.fit_transform(X_data.copy(), y_data.copy(), sample_weights.copy())
print("Original X shape:", X_data.shape)
print("Transformed X shape:", X_transformed.shape)
print("Original y length:", len(y_data))
print("Transformed y length:", len(y_transformed))
print("Transformed X (first 5 rows):\n", X_transformed.head() if isinstance(X_transformed, pd.DataFrame) else X_transformed[:5])