Kedro: Reproducible, Maintainable Data Science Pipelines
Most data science projects start in a single notebook and slowly turn into a tangle of cells, hardcoded file paths, and code that only runs on one laptop. Kedro is an open-source Python framework that applies software-engineering discipline to data science work, giving your project a standard structure, a declarative way to manage data, and a clean separation between code, configuration, and data. This tutorial walks through the concepts and shows a complete, runnable example.
What Kedro Is and the Problems It Solves
Kedro is not a scheduler and not a notebook tool. It is a project framework: an opinionated way to organise a data science codebase so that it is reproducible, testable, and ready to hand off to other engineers. It was created at QuantumBlack (part of McKinsey) and is now maintained under the LF AI & Data Foundation.
Anyone who has maintained data science code recognises the recurring problems:
- Notebook chaos. Logic lives in cells that must be run in the right order. Hidden state makes results impossible to reproduce.
- Hardcoded paths.
pd.readcsv("/Users/ruby/Downloads/datav3final.csv")works on exactly one machine and breaks the moment a colleague clones the repository. - Tangled concerns. Connection strings, model hyperparameters, and business logic are interleaved in the same file, so changing an environment means editing source code.
- No clear lineage. It is unclear which function produces which artifact and what depends on what.
Kedro addresses these through a few core principles:
- Separation of code, configuration, and data. Code lives in
src/, configuration inconf/, data indata/. Each can change independently. - A declarative Data Catalog. Datasets are named entities described in YAML, never raw paths buried in code.
- Modularity. Work is broken into small pure functions (nodes) composed into pipelines, which makes testing and reuse straightforward.
- Reproducibility. Anyone can clone the repository, install dependencies, and run
kedro runto get the same result.
Importantly, Kedro provides structure, not scheduling. When you need cron-like orchestration, retries, or a production scheduler, you deploy a Kedro pipeline onto Airflow, Dagster, Argo, Databricks, or similar. Kedro gives shape to the project; those tools run it on a schedule.
Installation and Creating a Project
Kedro requires Python 3.9 or newer. Always install into a virtual environment.
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install kedro
kedro info
Create a new project. Kedro uses starters (templates); the example below uses the standard tooling prompts.
kedro new --name price-prediction
You will be asked which tooling to include (linting, testing, logging, documentation, data structure). For this tutorial we assume a project named price-prediction for a house price regression model.
The Standard Directory Structure
price-prediction/
├── conf/
│ ├── base/
│ │ ├── catalog.yml # dataset definitions
│ │ ├── parameters.yml # pipeline parameters
│ │ └── logging.yml
│ └── local/
│ └── credentials.yml # secrets, never committed
├── data/
│ ├── 01raw/ # immutable source data
│ ├── 02intermediate/ # typed, cleaned data
│ ├── 03primary/
│ ├── 04feature/
│ ├── 05modelinput/
│ ├── 06models/
│ ├── 07modeloutput/
│ └── 08reporting/
├── notebooks/
├── src/
│ └── priceprediction/
│ ├── pipelines/
│ ├── pipelineregistry.py
│ └── settings.py
├── pyproject.toml
└── requirements.txt
The layered data/ folders express a data-engineering convention: raw data is immutable, and each subsequent layer is derived from the previous one. You are free to use these layers as labels; Kedro does not force a fixed meaning, but the convention makes lineage obvious at a glance.