Kedro¶

Both Kedro and Hamilton are Python tools to help define directed acyclic graph (DAG) of data transformations. While there’s overlap between the two in terms of features, we note two main differences:

  • Kedro is imperative and focuses on tasks; Hamilton is declarative and focuses on assets.

  • Kedro is heavier and comes with a project structure, YAML configs, and dataset definition to manage; Hamilton is lighter to adopt and you can progressively opt-in features that you find valuable.

On this page, we’ll dive into these differences, compare features, and present some code snippets from both tools.

Note

See this GitHub repository to compare a full project using Kedro or Hamilton.

Imperative vs. Declarative¶

There are 3 steps to build and run a dataflow (a DAG, a data pipeline, etc.)

  1. Define transformation steps

  2. Assemble steps into a dataflow

  3. Execute the dataflow to produce data artifacts (tables, ML models, etc.)

1. Define steps¶

Imperative (Kedro) vs. declarative (Hamilton) leads to significant differences in Step 2 and Step 3 that will shape how you work with the tool. However, Step 1 remains similar. In fact, both tools use the term nodes to refer to steps.

Kedro (imperative)

Hamilton (declarative)

# nodes.py
import pandas as pd

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for companies."""
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    return companies

def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for shuttles."""
    shuttles["d_check_complete"] = _is_true(
        shuttles["d_check_complete"]
    )
    shuttles["moon_clearance_complete"] = _is_true(
        shuttles["moon_clearance_complete"]
    )
    return shuttles

def create_model_input_table(
    shuttles: pd.DataFrame, companies: pd.DataFrame,
) -> pd.DataFrame:
    """Combines all data to create a model input table."""
    shuttles = shuttles.drop("id", axis=1)
    model_input_table = shuttles.merge(
        companies, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table
# dataflow.py
import pandas as pd

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def companies_preprocessed(companies: pd.DataFrame) -> pd.DataFrame:
    """Companies with added column `iata_approved`"""
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    return companies

def shuttles_preprocessed(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Shuttles with added columns `d_check_complete`
    and `moon_clearance_complete`."""
    shuttles["d_check_complete"] = _is_true(
        shuttles["d_check_complete"]
    )
    shuttles["moon_clearance_complete"] = _is_true(
        shuttles["moon_clearance_complete"]
    )
    return shuttles

def model_input_table(
    shuttles_preprocessed: pd.DataFrame,
    companies_preprocessed: pd.DataFrame,
) -> pd.DataFrame:
    """Table containing shuttles and companies data."""
    shuttles_preprocessed = shuttles_preprocessed.drop("id", axis=1)
    model_input_table = shuttles_preprocessed.merge(
        companies_preprocessed, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table

The function implementations are exactly the same. Yet, notice that the function names and docstrings were edited slightly. Imperative approaches like Kedro typically refer to steps as tasks and prefer verbs to describe “the action of the function”. Meanwhile, declarative approaches such as Hamilton describe steps as assets and use nouns to refer to “the value returned by the function”. This might appear superficial, but it relates to the difference in Step 2 and Step 3.

2. Assemble dataflow¶

With Kedro, you need to take your functions from Step 1 and create node objects, specifying the node’s name, inputs, and outputs. Then, you create a pipeline from a set of nodes and Kedro assembles the nodes into a DAG. Imperative approaches need to specify how tasks (Kedro nodes) relate to each other.

With Hamilton, you pass the module containing all functions from Step 1 and let Hamilton create the nodes and the dataflow. This is possible because in declarative approaches like Hamilton, each function defines a transform and its dependencies on other functions. Notice how in Step 1, model_input_table() has parameters shuttles_preprocessed and companies_preprocessed, which refers to other functions in the module. This contains all the required information to build the DAG.

Kedro (imperative)

Hamilton (declarative)

# pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from nodes import (
    create_model_input_table,
    preprocess_companies,
    preprocess_shuttles
)

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=[
                    "preprocessed_shuttles",
                    "preprocessed_companies"
                ],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ]
    )
# run.py
from hamilton import driver
import dataflow  # module containing node definitions

# pass the module to the `Builder` to create a `Driver`
dr = driver.Builder().with_modules(dataflow).build()

Benefits of adopting a declarative approach

  • Less errors since you skip manual node creation (i.e., strings will lead to typos).

  • Handle complexity since assembling a dataflow remains the same for 10 or 1000 nodes.

  • Maintainability improves since editing your functions (Step 1) modifies the structure of your DAG, removing the pipeline definition as a failure point.

  • Readability improves because you can understand how functions relate to each other without jumping between files.

These benefits of Hamilton encourage developers to write smaller functions that are easier to debug and maintain, leading to major code quality gains. On the opposite, the burden of node and pipeline creation as projects grow in size lead to users stuffing more and more logic in a single node, making it increasingly harder to maintain.

3. Execute dataflow¶

The primary way to execute Kedro pipelines is to use the command line tool with kedro run --pipeline=my_pipeline. Pipelines are typically designed for all nodes to be executed while reading data and writing results while going through nodes. It is closer to macro-orchestration frameworks like Airflow in spirit.

On the opposite, Hamilton dataflows are primarily meant to be executed programmatically (i.e., via Python code) and return results in-memory. This makes it easy to use Hamilton within a FastAPI service service or to power an LLM application.

For comparable side-by-side code, we can dig into Kedro and use the SequentialRunner programmatically. To return pipeline results in-memory we would need to hack further with kedro.io.MemoryDataset.

Note

Hamilton also has rich support for I/O operations (see Feature comparison below)

Kedro (imperative)

Hamilton (declarative)

# run.py
from kedro.runner import SequentialRunner
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pipeline import create_pipeline
# ^ from Step 2

bootstrap_project(".")
with KedroSession.create() as session:
    context = session.load_context()
    catalog = context.catalog

pipeline = create_pipeline().to_nodes("create_model_input_table")
SequentialRunner().run(pipeline, catalog)
# doesn't return values in-memory
# run.py
import pandas as pd
from hamilton import driver
import dataflow

dr = driver.Builder().with_modules(dataflow).build()
# ^ from Step 2
inputs = dict(
    companies=pd.read_parquet("path/to/companies.parquet"),
    shuttles=pd.read_parquet("path/to/shuttles.parquet"),
)
results = dr.execute(["model_input_table"], inputs=inputs)
# results is a dict {"model_input_table": VALUE}

An imperative pipeline like Kedro is a series of step, just like a recipe. The user can specify “from nodes” or “to nodes” to slice the pipeline and not have to execute it in full.

For declarative dataflows like Hamilton you request assets / nodes by name and the tool will determine the required nodes to execute (here "model_input_table") avoiding wasteful compute.

The simple Python interface provided by Hamilton allows you to potentially define and execute your dataflow from a single file, which is great to kickstart an analysis or project. Just use python dataflow.py to execute it!

# dataflow.py
import pandas as pd

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for companies."""
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    return companies

def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for shuttles."""
    shuttles["d_check_complete"] = _is_true(
        shuttles["d_check_complete"]
    )
    shuttles["moon_clearance_complete"] = _is_true(
        shuttles["moon_clearance_complete"]
    )
    return shuttles

def create_model_input_table(
    shuttles: pd.DataFrame, companies: pd.DataFrame,
) -> pd.DataFrame:
    """Combines all data to create a model input table."""
    shuttles = shuttles.drop("id", axis=1)
    model_input_table = shuttles.merge(
        companies, left_on="company_id", right_on="id"
    )
    model_input_table = model_input_table.dropna()
    return model_input_table

if __name__ == "__main__":
    from hamilton import driver
    import dataflow  # import itself as a module

    dr = driver.Builder().with_modules(dataflow).build()
    inputs=dict(
        companies=pd.read_parquet("path/to/companies.parquet"),
        shuttles=pd.read_parquet("path/to/shuttles.parquet"),
    )
    results = dr.execute(["model_input_table"], inputs=inputs)

Framework weight¶

After imperative vs. declarative, the next largest difference is the type of user experience they provide. Kedro is a more opiniated and heavier framework; Hamilton is on the opposite end of the spectrum and tries to be the lightest library possible. This changes the learning curve, adoption, and how each tool will integrate with your stack.

Kedro¶

Kedro is opiniated and provides clear guardrails on how to do things. To begin using it, you’ll need to learn to:

  • Define nodes and register pipelines

  • Register datasets using the data catalog construct

  • Pass parameters to data runs

  • Configure environment variables and credentials

  • Navigate the project structure

This provides guidance when building your first data pipeline, but it’s also a lot to take in at once. As you’ll see in the project comparison on GitHub, Kedro involves more files making it harder to navigate. Also, it’s reliant on YAML which is generally seen as an unreliable format. If you have an existing data stack or favorite library, it might clash with Kedro’s way of thing (e.g., you have credentials management tool; you prefer Hydra for configs).

Hamilton¶

Hamilton attempts to get you started quickly. In fact, this page pretty much covered what you need to know:

  • Define nodes and a dataflow using regular Python functions (no need to even import hamilton!)

  • Build a Driver with your dataflow module and call .execute() to get results

Hamilton allows you to start light and opt-in features as your project’s requirements evolve (data validation, scaling compute, testing, etc.). Python is a powerful language with rich editor support and tooling hence why it advocates for “everything in Python” instead of external configs in YAML or JSON. For example, parameters, data assets, and configurations can very much live as dataclasses within a .py file. Hamilton was built with an extensive plugin system. There are many extensions, some contributed by users, to adapt Hamilton to your project, and it’s easy for you to extend yourself for further customization.

In fact, Hamilton is so lightweight, you could even run it inside Kedro!

Feature comparison¶

Trait

Kedro

Hamilton

Focuses on

Tasks (imperative)

Assets (declarative)

Code structure

Opiniated. Makes assumptions about pipeline creation & registration and configuration.

Unopiniated.

In-memory execution

Execute using a KedroSession, but returning values in-memory is hacky.

Default

I/O execution

Datasets and Data Catalog

Data Savers & Loaders

Expressive DAG definition

â›”

Function modifiers

Column-level transformations

â›”

âś…

LLM applications

â›” Limited by in-memory execution and return values.

âś… declarative API in-memory makes it easy (RAG app).

Static DAG visualizations

Need Kedro Viz installed to export static visualizations.

Visualize entire dataflow, execution path, query what’s upstream, etc. directly in a notebook or output to a file (.png, .svg, etc.). Single dependency is graphviz.

Interactive DAG viewer

Kedro Viz

Hamilton UI

Data validation

Community Pandera plugin

Native and Pandera plugin

Executors

Sequential, multiprocessing, multi-threading

Sequential, async, multiprocessing, multi-threading

Executor extension

Spark integration

PySpark, Dask, Ray, Modal

Dynamic branching

â›”

Parallelizable/Collect for easy parallelization.

Command line tool (CLI)

âś…

âś…

Node and pipeline testing

âś…

âś…

Jupyter notebook extensions

âś…

âś…

Both Kedro and Hamilton provide applications to view dataflows/pipelines and interact with their results. Here, Kedro provides a lighter webserver and UI, while Hamilton offers a production-ready containerized application.

Trait

Kedro Viz

Hamilton UI

Interactive dataflow viewer

âś…

âś…

View code definition of nodes

âś…

âś…

Code versioning

Git SHA (may be out of sync with actual code)

Node-level versioning at runtime

Collapsible view

âś…

âś…

Tag nodes

âś…

âś…

Execution observability

â›”

âś…

Artifact lineage and versioning

â›”

âś…

Column-level lineage

â›”

âś…

Compare run results

âś…

âś…

Rich artifact view

Preview 5 dataframe rows. Metadata about artifact (column count, row count, size).

Automatic statistical profiling of various dataframe libraries.

More information¶

For a full side-by-side example of Kedro and Hamilton, visit this GitHub repository

For more questions, join our Slack Channel