{"id":4070,"library":"kedro-datasets","title":"Kedro-Datasets","description":"Kedro-Datasets provides a comprehensive collection of data connectors for Kedro projects, enabling seamless interaction with various data sources and formats like CSV, Parquet, Spark, and cloud storage. It's an active library, typically releasing new features and updates monthly or bi-monthly, ensuring compatibility with the latest data technologies.","status":"active","version":"9.3.0","language":"en","source_language":"en","source_url":"https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets","tags":["data engineering","etl","data catalog","mlops","data connector","data pipeline","pandas","spark"],"install":[{"cmd":"pip install kedro-datasets","lang":"bash","label":"Base installation"},{"cmd":"pip install kedro-datasets[all]","lang":"bash","label":"All optional dependencies"},{"cmd":"pip install kedro-datasets[pandas,spark,s3]","lang":"bash","label":"Common optional dependencies"}],"dependencies":[{"reason":"Required for pandas-related datasets like CSVDataset, ParquetDataset. Compatibility with pandas 3.0 added in 9.3.0.","package":"pandas","optional":true},{"reason":"Required for Spark-related datasets like SparkDataset, DeltaLakeDataset.","package":"pyspark","optional":true},{"reason":"Required for XMLDataset, especially for Python 3.13+.","package":"lxml","optional":true},{"reason":"Required for Ibis-related datasets like TableDataset.","package":"ibis-framework","optional":true}],"imports":[{"note":"Old import path from pre-0.18 Kedro versions or incorrect casing. Datasets are now in `kedro_datasets`.","wrong":"from kedro.io import CSVDataSet","symbol":"CSVDataset","correct":"from kedro_datasets.pandas import CSVDataset"},{"symbol":"SparkDataset","correct":"from kedro_datasets.spark import SparkDataset"},{"note":"MatplotlibWriter was removed in 9.0.0; use MatplotlibDataset instead.","wrong":"from kedro_datasets.matplotlib import MatplotlibWriter","symbol":"MatplotlibDataset","correct":"from kedro_datasets.matplotlib import MatplotlibDataset"}],"quickstart":{"code":"import pandas as pd\nimport os\nfrom kedro_datasets.pandas import CSVDataset\n\n# 1. Create a dummy CSV file\ndata = {\"col1\": [1, 2, 3], \"col2\": [\"A\", \"B\", \"C\"]}\ndf = pd.DataFrame(data)\nfilepath = \"my_dummy_data.csv\"\ndf.to_csv(filepath, index=False)\n\nprint(f\"Created dummy data at: {filepath}\\n\")\n\n# 2. Initialize the CSVDataset\ncsv_dataset = CSVDataset(filepath=filepath, save_args={\"index\": False})\n\n# 3. Load data\nloaded_df = csv_dataset.load()\nprint(\"Loaded DataFrame from CSVDataset:\\n\")\nprint(loaded_df)\n\n# 4. Save new data using the dataset\nnew_data = pd.DataFrame({\"col1\": [4, 5], \"col2\": [\"D\", \"E\"]})\ncsv_dataset.save(new_data)\nprint(\"\\nSaved new data to the CSV file.\\n\")\n\n# 5. Verify by loading again\nreloaded_df = csv_dataset.load()\nprint(\"Reloaded DataFrame after saving new data:\\n\")\nprint(reloaded_df)\n\n# 6. Clean up the dummy file\nos.remove(filepath)\nprint(f\"\\nCleaned up dummy data file: {filepath}\")","lang":"python","description":"This quickstart demonstrates how to programmatically initialize, load, and save data using a common dataset type (CSVDataset) from `kedro-datasets`. While `kedro-datasets` is often used within Kedro project configuration (e.g., `catalog.yml`), direct programmatic usage is also fully supported."},"warnings":[{"fix":"Migrate any usage of `MatplotlibWriter` to `MatplotlibDataset`.","message":"The `MatplotlibWriter` dataset was removed in `kedro-datasets` version 9.0.0. Its functionality has been absorbed and replaced by `MatplotlibDataset`.","severity":"breaking","affected_versions":">=9.0.0"},{"fix":"Replace `overwrite=True/False` with `mode='overwrite'` or `mode='append'` respectively. Supported modes include 'append', 'overwrite', 'error'/'errorifexists', and 'ignore'.","message":"The `overwrite` argument for `ibis.TableDataset` was deprecated in `kedro-datasets` version 9.0.0. It is mapped to the new `mode` argument for backward compatibility but will be removed in a future release.","severity":"deprecated","affected_versions":">=9.0.0"},{"fix":"Install `kedro-datasets` with the specific extras needed for your datasets, e.g., `pip install kedro-datasets[pandas,spark]`, or `pip install kedro-datasets[all]` for comprehensive coverage.","message":"Many `kedro-datasets` rely on optional dependencies (extras). If you install `kedro-datasets` without specifying the necessary extras (e.g., `[pandas]`, `[spark]`, `[s3]`), you will encounter `ModuleNotFoundError` or `ImportError` when trying to use datasets that require them.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure `kedro-datasets` is updated to version 9.3.0 or newer if using pandas 3.0. For older `kedro-datasets` versions, stick to pandas <3.0.","message":"`kedro-datasets` version 9.3.0 introduced compatibility with pandas 3.0. Users on older `kedro-datasets` versions combined with pandas 3.0 might experience unexpected behavior or errors.","severity":"gotcha","affected_versions":"<9.3.0"},{"fix":"When using experimental datasets, monitor release notes for potential changes. For production systems, prefer stable, non-experimental datasets or ensure thorough testing with specific experimental versions.","message":"New \"experimental\" datasets are frequently introduced (e.g., in 9.2.0 and 9.3.0). These datasets are subject to change, including API modifications or even removal, without necessarily being flagged as 'breaking changes' in minor versions.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}