datasieve: Flexible Data Pipeline

0.1.9 · active · verified Thu Apr 16

The `datasieve` package provides a flexible data pipeline inspired by scikit-learn's Pipeline, but with enhanced capabilities to manipulate `y` (target) and `sample_weight` arrays alongside `X` (features). This is particularly useful for tasks such as removing outliers across all associated data, removing feature columns based on arbitrary criteria, and handling dynamic feature renaming within the pipeline. The current version is 0.1.9, with releases occurring on an irregular, as-needed basis.

Common errors

Warnings

Install

Imports

Quickstart

This example demonstrates how to build a `datasieve` pipeline using `Pipeline` and custom `transforms`. It includes `VarianceThreshold` to remove constant features, `SKlearnWrapper` to incorporate a `MinMaxScaler` from scikit-learn, `SVMOutlierExtractor` to identify and remove outliers (propagating removal to y and sample_weights), and `PCA` for dimensionality reduction. The example generates synthetic data with a constant feature and outliers to showcase the pipeline's capabilities.

import pandas as pd
from datasieve.pipeline import Pipeline
import datasieve.transforms as dst
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Create sample data
X_data = pd.DataFrame(np.random.rand(100, 5), columns=[f'feature_{i}' for i in range(5)])
y_data = pd.Series(np.random.randint(0, 2, 100))
sample_weights = np.random.rand(100)

# Introduce some zero-variance feature and outliers for demonstration
X_data['feature_0'] = 1.0 # Zero variance
X_data.iloc[0:5, 1] = 1000.0 # Outliers

# Build the datasieve pipeline
feature_pipeline = Pipeline([
    ("detect_constants", dst.VarianceThreshold(threshold=0)), # Removes zero-variance features
    ("pre_svm_scaler", dst.SKlearnWrapper(MinMaxScaler(feature_range=(-1, 1)))),
    ("svm_outlier_extractor", dst.SVMOutlierExtractor(nu=0.1, kernel='rbf')),
    ("pca", dst.PCA(n_components=0.95)) # Dimensionality reduction
])

# Fit and transform the data
X_transformed, y_transformed, sample_weights_transformed = \
    feature_pipeline.fit_transform(X_data.copy(), y_data.copy(), sample_weights.copy())

print("Original X shape:", X_data.shape)
print("Transformed X shape:", X_transformed.shape)
print("Original y length:", len(y_data))
print("Transformed y length:", len(y_transformed))
print("Transformed X (first 5 rows):\n", X_transformed.head() if isinstance(X_transformed, pd.DataFrame) else X_transformed[:5])

view raw JSON →