Kedro-Datasets
Kedro-Datasets provides a comprehensive collection of data connectors for Kedro projects, enabling seamless interaction with various data sources and formats like CSV, Parquet, Spark, and cloud storage. It's an active library, typically releasing new features and updates monthly or bi-monthly, ensuring compatibility with the latest data technologies.
Warnings
- breaking The `MatplotlibWriter` dataset was removed in `kedro-datasets` version 9.0.0. Its functionality has been absorbed and replaced by `MatplotlibDataset`.
- deprecated The `overwrite` argument for `ibis.TableDataset` was deprecated in `kedro-datasets` version 9.0.0. It is mapped to the new `mode` argument for backward compatibility but will be removed in a future release.
- gotcha Many `kedro-datasets` rely on optional dependencies (extras). If you install `kedro-datasets` without specifying the necessary extras (e.g., `[pandas]`, `[spark]`, `[s3]`), you will encounter `ModuleNotFoundError` or `ImportError` when trying to use datasets that require them.
- gotcha `kedro-datasets` version 9.3.0 introduced compatibility with pandas 3.0. Users on older `kedro-datasets` versions combined with pandas 3.0 might experience unexpected behavior or errors.
- gotcha New "experimental" datasets are frequently introduced (e.g., in 9.2.0 and 9.3.0). These datasets are subject to change, including API modifications or even removal, without necessarily being flagged as 'breaking changes' in minor versions.
Install
-
pip install kedro-datasets -
pip install kedro-datasets[all] -
pip install kedro-datasets[pandas,spark,s3]
Imports
- CSVDataset
from kedro_datasets.pandas import CSVDataset
- SparkDataset
from kedro_datasets.spark import SparkDataset
- MatplotlibDataset
from kedro_datasets.matplotlib import MatplotlibDataset
Quickstart
import pandas as pd
import os
from kedro_datasets.pandas import CSVDataset
# 1. Create a dummy CSV file
data = {"col1": [1, 2, 3], "col2": ["A", "B", "C"]}
df = pd.DataFrame(data)
filepath = "my_dummy_data.csv"
df.to_csv(filepath, index=False)
print(f"Created dummy data at: {filepath}\n")
# 2. Initialize the CSVDataset
csv_dataset = CSVDataset(filepath=filepath, save_args={"index": False})
# 3. Load data
loaded_df = csv_dataset.load()
print("Loaded DataFrame from CSVDataset:\n")
print(loaded_df)
# 4. Save new data using the dataset
new_data = pd.DataFrame({"col1": [4, 5], "col2": ["D", "E"]})
csv_dataset.save(new_data)
print("\nSaved new data to the CSV file.\n")
# 5. Verify by loading again
reloaded_df = csv_dataset.load()
print("Reloaded DataFrame after saving new data:\n")
print(reloaded_df)
# 6. Clean up the dummy file
os.remove(filepath)
print(f"\nCleaned up dummy data file: {filepath}")