Kedro: Reproducible, Maintainable Data Science Pipelines

Most data science projects start in a single notebook and slowly turn into a tangle of cells, hardcoded file paths, and code that only runs on one laptop. Kedro is an open-source Python framework that applies software-engineering discipline to data science work, giving your project a standard structure, a declarative way to manage data, and a clean separation between code, configuration, and data. This tutorial walks through the concepts and shows a complete, runnable example.

What Kedro Is and the Problems It Solves

Kedro is not a scheduler and not a notebook tool. It is a project framework: an opinionated way to organise a data science codebase so that it is reproducible, testable, and ready to hand off to other engineers. It was created at QuantumBlack (part of McKinsey) and is now maintained under the LF AI & Data Foundation.

Anyone who has maintained data science code recognises the recurring problems:

Notebook chaos. Logic lives in cells that must be run in the right order. Hidden state makes results impossible to reproduce.
Hardcoded paths. pd.readcsv("/Users/ruby/Downloads/datav3final.csv") works on exactly one machine and breaks the moment a colleague clones the repository.

Tangled concerns. Connection strings, model hyperparameters, and business logic are interleaved in the same file, so changing an environment means editing source code.

No clear lineage. It is unclear which function produces which artifact and what depends on what.

Kedro addresses these through a few core principles:

Separation of code, configuration, and data. Code lives in src/, configuration in conf/, data in data/. Each can change independently.

A declarative Data Catalog. Datasets are named entities described in YAML, never raw paths buried in code.

Modularity. Work is broken into small pure functions (nodes) composed into pipelines, which makes testing and reuse straightforward.

Reproducibility. Anyone can clone the repository, install dependencies, and run kedro run to get the same result.

Importantly, Kedro provides structure, not scheduling. When you need cron-like orchestration, retries, or a production scheduler, you deploy a Kedro pipeline onto Airflow, Dagster, Argo, Databricks, or similar. Kedro gives shape to the project; those tools run it on a schedule.

Installation and Creating a Project

Kedro requires Python 3.9 or newer. Always install into a virtual environment.

python -m venv .venv source .venv/bin/activate # on Windows: .venv\Scripts\activate pip install kedro kedro info

Create a new project. Kedro uses starters (templates); the example below uses the standard tooling prompts.

kedro new --name price-prediction

You will be asked which tooling to include (linting, testing, logging, documentation, data structure). For this tutorial we assume a project named price-prediction for a house price regression model.

The Standard Directory Structure

price-prediction/
├── conf/
│   ├── base/
│   │   ├── catalog.yml          # dataset definitions
│   │   ├── parameters.yml       # pipeline parameters
│   │   └── logging.yml
│   └── local/
│       └── credentials.yml      # secrets, never committed
├── data/
│   ├── 01raw/                  # immutable source data
│   ├── 02intermediate/         # typed, cleaned data

│   ├── 03primary/
│   ├── 04feature/

│   ├── 05modelinput/

│   ├── 06models/
│   ├── 07modeloutput/
│   └── 08reporting/

├── notebooks/
├── src/
│   └── priceprediction/
│       ├── pipelines/
│       ├── pipelineregistry.py

│       └── settings.py
├── pyproject.toml
└── requirements.txt

The layered data/ folders express a data-engineering convention: raw data is immutable, and each subsequent layer is derived from the previous one. You are free to use these layers as labels; Kedro does not force a fixed meaning, but the convention makes lineage obvious at a glance.

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Reproducible, Maintainable Data Science Pipelines

What Kedro Is and the Problems It Solves

Installation and Creating a Project

The Standard Directory Structure

Related Articles

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

DuckDB: In-Process Analytical Database for Data Science

PySpark for Machine Learning Tutorial: Big Data ML Pipeline

Related Articles

ZenML: Modular and Cloud-Agnostic MLOps Pipeline Framework

ZenML: Framework Pipeline MLOps yang Modular dan Cloud-Agnostic Pendahuluan Membangun model machine learning yang akurat...

Ray Train & Ray Tune Tutorial: Distributed Training and Hyperparameter Tuning

Ray Train & Ray Tune: Pelatihan Terdistribusi dan Penyetelan Hiperparameter Sebagian besar proyek machine learning dimul...

DuckDB: In-Process Analytical Database for Data Science

DuckDB: Database Analitik In-Process untuk Data Science DuckDB adalah database analitik in-process yang dirancang khusus...

PySpark for Machine Learning Tutorial: Big Data ML Pipeline

PySpark untuk Machine Learning Daftar Isi Pendahuluan Prasyarat Dasar Spark dan SparkSession Operasi DataFrame