Kedro-Datasets

9.3.0 · active · verified Sat Apr 11

Kedro-Datasets provides a comprehensive collection of data connectors for Kedro projects, enabling seamless interaction with various data sources and formats like CSV, Parquet, Spark, and cloud storage. It's an active library, typically releasing new features and updates monthly or bi-monthly, ensuring compatibility with the latest data technologies.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to programmatically initialize, load, and save data using a common dataset type (CSVDataset) from `kedro-datasets`. While `kedro-datasets` is often used within Kedro project configuration (e.g., `catalog.yml`), direct programmatic usage is also fully supported.

import pandas as pd
import os
from kedro_datasets.pandas import CSVDataset

# 1. Create a dummy CSV file
data = {"col1": [1, 2, 3], "col2": ["A", "B", "C"]}
df = pd.DataFrame(data)
filepath = "my_dummy_data.csv"
df.to_csv(filepath, index=False)

print(f"Created dummy data at: {filepath}\n")

# 2. Initialize the CSVDataset
csv_dataset = CSVDataset(filepath=filepath, save_args={"index": False})

# 3. Load data
loaded_df = csv_dataset.load()
print("Loaded DataFrame from CSVDataset:\n")
print(loaded_df)

# 4. Save new data using the dataset
new_data = pd.DataFrame({"col1": [4, 5], "col2": ["D", "E"]})
csv_dataset.save(new_data)
print("\nSaved new data to the CSV file.\n")

# 5. Verify by loading again
reloaded_df = csv_dataset.load()
print("Reloaded DataFrame after saving new data:\n")
print(reloaded_df)

# 6. Clean up the dummy file
os.remove(filepath)
print(f"\nCleaned up dummy data file: {filepath}")

view raw JSON →