Kedro
Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It applies software engineering best practices to data and analytics pipelines. The current version is 1.3.1, and releases are frequent, typically with patch and minor updates released monthly, and major versions less often.
Warnings
- breaking Kedro dropped support for Python 3.9 in version 1.1.0. Projects using Kedro 1.1.0 or newer must use Python 3.10 or later.
- breaking The `KedroDataCatalog` class was renamed to `DataCatalog` and became the default catalog implementation in Kedro 1.0.0. While most standard workflows were unaffected, programmatic interactions with the catalog, especially direct instantiation or accessing missing datasets (`__getitem__`), might require updates.
- gotcha Kedro relies heavily on its project structure (created by `kedro new`) and configuration files in the `conf/` directory. Deviations or manually created projects without the correct structure can lead to `KedroContextError` or `ConfigLoaderError`.
- gotcha Confusion between `params:` (dynamic parameters passed as node inputs) and `parameters:` (static configuration loaded from `conf/catalog.yml`) is a common pitfall. The new parameter validation (Kedro >=1.3.0) specifically targets `params:` inputs.
- gotcha Prior to `kedro==1.1.1`, the `project_version` specified in `src/<project_name>/settings.py` had to *exactly match* the installed Kedro package version (including minor and patch versions) to avoid a `ProjectVersionError`.
- deprecated The `--namespace` CLI flag for `kedro run` was deprecated in version 0.19.15 and is discouraged. Kedro now promotes using proper modular pipelines and explicit dataset prefixing for organization.
- gotcha Public APIs marked with the `@experimental` decorator (introduced in 1.2.0) are unstable and may change without backward compatibility guarantees. Use them with caution.
Install
-
pip install kedro -
pip install "kedro[pandas,spark]"
Imports
- Pipeline
from kedro.pipeline import Pipeline
- node
from kedro.pipeline import node
- DataCatalog
from kedro.io import DataCatalog
- MemoryDataSet
from kedro.io import MemoryDataSet
- SequentialRunner
from kedro.runner import SequentialRunner
- KedroSession
from kedro.framework.session import KedroSession
Quickstart
from kedro.io import DataCatalog, MemoryDataSet
from kedro.pipeline import Pipeline, node
from kedro.runner import SequentialRunner
# 1. Define node functions (plain Python functions)
def greet(name: str) -> str:
"""A node that greets a given name."""
return f"Hello, {name}!"
def capitalize(text: str) -> str:
"""A node that capitalizes a string."""
return text.upper()
# 2. Assemble nodes into a pipeline
def create_example_pipeline() -> Pipeline:
return Pipeline([
node(
func=greet,
inputs="input_name", # Input dataset key
outputs="greeting_message", # Output dataset key
name="greet_user_node"
),
node(
func=capitalize,
inputs="greeting_message",
outputs="final_output", # Final output dataset key
name="capitalize_message_node"
)
])
# 3. Create a DataCatalog with input data
# In a real Kedro project, this is usually defined in conf/base/catalog.yml
catalog = DataCatalog({
"input_name": MemoryDataSet(data="World"),
"final_output": MemoryDataSet() # Define an output dataset to store results
})
# 4. Instantiate the pipeline and a runner
my_pipeline = create_example_pipeline()
runner = SequentialRunner()
# 5. Run the pipeline
# In a real Kedro project, `kedro run` via `KedroSession` orchestrates this.
print("Running Kedro pipeline...")
result_catalog = runner.run(my_pipeline, catalog)
# 6. Retrieve results
final_message = result_catalog.load("final_output")
print(f"Pipeline finished. Final message: {final_message}")
# Expected output: Pipeline finished. Final message: HELLO, WORLD!