{"id":8070,"library":"datasieve","title":"datasieve: Flexible Data Pipeline","description":"The `datasieve` package provides a flexible data pipeline inspired by scikit-learn's Pipeline, but with enhanced capabilities to manipulate `y` (target) and `sample_weight` arrays alongside `X` (features). This is particularly useful for tasks such as removing outliers across all associated data, removing feature columns based on arbitrary criteria, and handling dynamic feature renaming within the pipeline. The current version is 0.1.9, with releases occurring on an irregular, as-needed basis.","status":"active","version":"0.1.9","language":"en","source_language":"en","source_url":"https://github.com/emergentmethods/datasieve","tags":["data-pipeline","outlier-detection","feature-engineering","sklearn-compatible","data-preprocessing"],"install":[{"cmd":"pip install datasieve","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core functionality is built on and extends scikit-learn's API and transformers.","package":"scikit-learn","optional":false},{"reason":"Recommended for convenient handling of DataFrame inputs and maintaining column names throughout the pipeline.","package":"pandas","optional":true}],"imports":[{"symbol":"Pipeline","correct":"from datasieve.pipeline import Pipeline"},{"symbol":"transforms","correct":"import datasieve.transforms as dst"},{"symbol":"SKlearnWrapper","correct":"from datasieve.transforms import SKlearnWrapper"}],"quickstart":{"code":"import pandas as pd\nfrom datasieve.pipeline import Pipeline\nimport datasieve.transforms as dst\nfrom sklearn.preprocessing import MinMaxScaler\nimport numpy as np\n\n# Create sample data\nX_data = pd.DataFrame(np.random.rand(100, 5), columns=[f'feature_{i}' for i in range(5)])\ny_data = pd.Series(np.random.randint(0, 2, 100))\nsample_weights = np.random.rand(100)\n\n# Introduce some zero-variance feature and outliers for demonstration\nX_data['feature_0'] = 1.0 # Zero variance\nX_data.iloc[0:5, 1] = 1000.0 # Outliers\n\n# Build the datasieve pipeline\nfeature_pipeline = Pipeline([\n    (\"detect_constants\", dst.VarianceThreshold(threshold=0)), # Removes zero-variance features\n    (\"pre_svm_scaler\", dst.SKlearnWrapper(MinMaxScaler(feature_range=(-1, 1)))),\n    (\"svm_outlier_extractor\", dst.SVMOutlierExtractor(nu=0.1, kernel='rbf')),\n    (\"pca\", dst.PCA(n_components=0.95)) # Dimensionality reduction\n])\n\n# Fit and transform the data\nX_transformed, y_transformed, sample_weights_transformed = \\\n    feature_pipeline.fit_transform(X_data.copy(), y_data.copy(), sample_weights.copy())\n\nprint(\"Original X shape:\", X_data.shape)\nprint(\"Transformed X shape:\", X_transformed.shape)\nprint(\"Original y length:\", len(y_data))\nprint(\"Transformed y length:\", len(y_transformed))\nprint(\"Transformed X (first 5 rows):\\n\", X_transformed.head() if isinstance(X_transformed, pd.DataFrame) else X_transformed[:5])\n","lang":"python","description":"This example demonstrates how to build a `datasieve` pipeline using `Pipeline` and custom `transforms`. It includes `VarianceThreshold` to remove constant features, `SKlearnWrapper` to incorporate a `MinMaxScaler` from scikit-learn, `SVMOutlierExtractor` to identify and remove outliers (propagating removal to y and sample_weights), and `PCA` for dimensionality reduction. The example generates synthetic data with a constant feature and outliers to showcase the pipeline's capabilities."},"warnings":[{"fix":"Ensure that `X`, `y`, and `sample_weight` (if applicable) are aligned by index before passing them to `fit_transform` or `transform`. If using NumPy arrays, ensure the row order is consistent.","message":"Input data (X, y, sample_weight) must maintain consistent row order and indices if DataFrames/Series are used. `datasieve` transforms operate on the assumption of this consistency, especially when removing rows (e.g., outliers) to propagate changes correctly across all inputs.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure that the input data for `transform` (e.g., test set) has the same feature set and order as the data used for `fit` (e.g., training set). `datasieve` handles feature renaming internally (e.g., after PCA), but initial input consistency is crucial. Pre-processing steps outside the pipeline should be consistently applied.","message":"Mismatch in the number or names of features between `fit` and `transform` calls can lead to errors, particularly after feature selection or dimensionality reduction steps within the pipeline.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `pip install datasieve` to install the package. If using a virtual environment, ensure it is activated before installation.","cause":"The datasieve library is not installed in the current Python environment.","error":"No module named 'datasieve'"},{"fix":"Verify that the feature names (column names if using pandas DataFrames) of the data being passed through the pipeline are consistent with what the pipeline expects based on its `fit` operation. Ensure that any manual feature engineering or selection outside the `datasieve` pipeline is applied consistently to both training and test data.","cause":"This error typically occurs when the column names (features) of the input data to a pipeline step do not match the expected features that the step was fitted on. This often happens if columns are dropped or renamed unexpectedly between `fit` and `transform` calls, or if the test set has different columns than the training set.","error":"Exception: Pipeline expected Index(...) but got Index(...)"}]}