Kedro¶
Both Kedro
and Hamilton
are Python tools to help define directed acyclic graph (DAG) of data transformations. While there’s overlap between the two in terms of features, we note two main differences:
Kedro
is imperative and focuses on tasks;Hamilton
is declarative and focuses on assets.Kedro
is heavier and comes with a project structure, YAML configs, and dataset definition to manage;Hamilton
is lighter to adopt and you can progressively opt-in features that you find valuable.
On this page, we’ll dive into these differences, compare features, and present some code snippets from both tools.
Note
See this GitHub repository to compare a full project using Kedro or Hamilton.
Imperative vs. Declarative¶
There are 3 steps to build and run a dataflow (a DAG, a data pipeline, etc.)
Define transformation steps
Assemble steps into a dataflow
Execute the dataflow to produce data artifacts (tables, ML models, etc.)
1. Define steps¶
Imperative (Kedro
) vs. declarative (Hamilton
) leads to significant differences in Step 2 and Step 3 that will shape how you work with the tool. However, Step 1 remains similar. In fact, both tools use the term nodes to refer to steps.
Kedro (imperative) |
Hamilton (declarative) |
---|---|
# nodes.py
import pandas as pd
def _is_true(x: pd.Series) -> pd.Series:
return x == "t"
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
"""Preprocesses the data for companies."""
companies["iata_approved"] = _is_true(companies["iata_approved"])
return companies
def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
"""Preprocesses the data for shuttles."""
shuttles["d_check_complete"] = _is_true(
shuttles["d_check_complete"]
)
shuttles["moon_clearance_complete"] = _is_true(
shuttles["moon_clearance_complete"]
)
return shuttles
def create_model_input_table(
shuttles: pd.DataFrame, companies: pd.DataFrame,
) -> pd.DataFrame:
"""Combines all data to create a model input table."""
shuttles = shuttles.drop("id", axis=1)
model_input_table = shuttles.merge(
companies, left_on="company_id", right_on="id"
)
model_input_table = model_input_table.dropna()
return model_input_table
|
# dataflow.py
import pandas as pd
def _is_true(x: pd.Series) -> pd.Series:
return x == "t"
def companies_preprocessed(companies: pd.DataFrame) -> pd.DataFrame:
"""Companies with added column `iata_approved`"""
companies["iata_approved"] = _is_true(companies["iata_approved"])
return companies
def shuttles_preprocessed(shuttles: pd.DataFrame) -> pd.DataFrame:
"""Shuttles with added columns `d_check_complete`
and `moon_clearance_complete`."""
shuttles["d_check_complete"] = _is_true(
shuttles["d_check_complete"]
)
shuttles["moon_clearance_complete"] = _is_true(
shuttles["moon_clearance_complete"]
)
return shuttles
def model_input_table(
shuttles_preprocessed: pd.DataFrame,
companies_preprocessed: pd.DataFrame,
) -> pd.DataFrame:
"""Table containing shuttles and companies data."""
shuttles_preprocessed = shuttles_preprocessed.drop("id", axis=1)
model_input_table = shuttles_preprocessed.merge(
companies_preprocessed, left_on="company_id", right_on="id"
)
model_input_table = model_input_table.dropna()
return model_input_table
|
The function implementations are exactly the same. Yet, notice that the function names and docstrings were edited slightly. Imperative approaches like Kedro
typically refer to steps as tasks and prefer verbs to describe “the action of the function”. Meanwhile, declarative approaches such as Hamilton
describe steps as assets and use nouns to refer to “the value returned by the function”. This might appear superficial, but it relates to the difference in Step 2 and Step 3.
2. Assemble dataflow¶
With Kedro
, you need to take your functions from Step 1 and create node
objects, specifying the node’s name, inputs, and outputs. Then, you create a pipeline
from a set of nodes
and Kedro
assembles the nodes into a DAG. Imperative approaches need to specify how tasks (Kedro nodes) relate to each other.
With Hamilton
, you pass the module containing all functions from Step 1 and let Hamilton create the nodes
and the dataflow
. This is possible because in declarative approaches like Hamilton, each function defines a transform and its dependencies on other functions. Notice how in Step 1, model_input_table()
has parameters shuttles_preprocessed
and companies_preprocessed
, which refers to other functions in the module. This contains all the required information to build the DAG.
Kedro (imperative) |
Hamilton (declarative) |
---|---|
# pipeline.py
from kedro.pipeline import Pipeline, node, pipeline
from nodes import (
create_model_input_table,
preprocess_companies,
preprocess_shuttles
)
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=[
"preprocessed_shuttles",
"preprocessed_companies"
],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
)
|
# run.py
from hamilton import driver
import dataflow # module containing node definitions
# pass the module to the `Builder` to create a `Driver`
dr = driver.Builder().with_modules(dataflow).build()
|
Benefits of adopting a declarative approach
Less errors since you skip manual node creation (i.e., strings will lead to typos).
Handle complexity since assembling a dataflow remains the same for 10 or 1000 nodes.
Maintainability improves since editing your functions (Step 1) modifies the structure of your DAG, removing the pipeline definition as a failure point.
Readability improves because you can understand how functions relate to each other without jumping between files.
These benefits of Hamilton
encourage developers to write smaller functions that are easier to debug and maintain, leading to major code quality gains. On the opposite, the burden of node
and pipeline
creation as projects grow in size lead to users stuffing more and more logic in a single node, making it increasingly harder to maintain.
3. Execute dataflow¶
The primary way to execute Kedro
pipelines is to use the command line tool with kedro run --pipeline=my_pipeline
. Pipelines are typically designed for all nodes to be executed while reading data and writing results while going through nodes. It is closer to macro-orchestration frameworks like Airflow in spirit.
On the opposite, Hamilton
dataflows are primarily meant to be executed programmatically (i.e., via Python code) and return results in-memory. This makes it easy to use Hamilton
within a FastAPI service service or to power an LLM application.
For comparable side-by-side code, we can dig into Kedro
and use the SequentialRunner
programmatically. To return pipeline results in-memory we would need to hack further with kedro.io.MemoryDataset
.
Note
Hamilton also has rich support for I/O operations (see Feature comparison below)
Kedro (imperative) |
Hamilton (declarative) |
---|---|
# run.py
from kedro.runner import SequentialRunner
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pipeline import create_pipeline
# ^ from Step 2
bootstrap_project(".")
with KedroSession.create() as session:
context = session.load_context()
catalog = context.catalog
pipeline = create_pipeline().to_nodes("create_model_input_table")
SequentialRunner().run(pipeline, catalog)
# doesn't return values in-memory
|
# run.py
import pandas as pd
from hamilton import driver
import dataflow
dr = driver.Builder().with_modules(dataflow).build()
# ^ from Step 2
inputs = dict(
companies=pd.read_parquet("path/to/companies.parquet"),
shuttles=pd.read_parquet("path/to/shuttles.parquet"),
)
results = dr.execute(["model_input_table"], inputs=inputs)
# results is a dict {"model_input_table": VALUE}
|
An imperative pipeline like Kedro
is a series of step, just like a recipe. The user can specify “from nodes” or “to nodes” to slice the pipeline and not have to execute it in full.
For declarative dataflows like Hamilton
you request assets / nodes by name and the tool will determine the required nodes to execute (here "model_input_table"
) avoiding wasteful compute.
The simple Python interface provided by Hamilton
allows you to potentially define and execute your dataflow from a single file, which is great to kickstart an analysis or project. Just use python dataflow.py
to execute it!
# dataflow.py
import pandas as pd
def _is_true(x: pd.Series) -> pd.Series:
return x == "t"
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
"""Preprocesses the data for companies."""
companies["iata_approved"] = _is_true(companies["iata_approved"])
return companies
def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:
"""Preprocesses the data for shuttles."""
shuttles["d_check_complete"] = _is_true(
shuttles["d_check_complete"]
)
shuttles["moon_clearance_complete"] = _is_true(
shuttles["moon_clearance_complete"]
)
return shuttles
def create_model_input_table(
shuttles: pd.DataFrame, companies: pd.DataFrame,
) -> pd.DataFrame:
"""Combines all data to create a model input table."""
shuttles = shuttles.drop("id", axis=1)
model_input_table = shuttles.merge(
companies, left_on="company_id", right_on="id"
)
model_input_table = model_input_table.dropna()
return model_input_table
if __name__ == "__main__":
from hamilton import driver
import dataflow # import itself as a module
dr = driver.Builder().with_modules(dataflow).build()
inputs=dict(
companies=pd.read_parquet("path/to/companies.parquet"),
shuttles=pd.read_parquet("path/to/shuttles.parquet"),
)
results = dr.execute(["model_input_table"], inputs=inputs)
Framework weight¶
After imperative vs. declarative, the next largest difference is the type of user experience they provide. Kedro
is a more opiniated and heavier framework; Hamilton
is on the opposite end of the spectrum and tries to be the lightest library possible. This changes the learning curve, adoption, and how each tool will integrate with your stack.
Kedro¶
Kedro
is opiniated and provides clear guardrails on how to do things. To begin using it, you’ll need to learn to:
Define nodes and register pipelines
Register datasets using the data catalog construct
Pass parameters to data runs
Configure environment variables and credentials
Navigate the project structure
This provides guidance when building your first data pipeline, but it’s also a lot to take in at once. As you’ll see in the project comparison on GitHub, Kedro
involves more files making it harder to navigate. Also, it’s reliant on YAML which is generally seen as an unreliable format. If you have an existing data stack or favorite library, it might clash with Kedro
’s way of thing (e.g., you have credentials management tool; you prefer Hydra for configs).
Hamilton¶
Hamilton
attempts to get you started quickly. In fact, this page pretty much covered what you need to know:
Define nodes and a dataflow using regular Python functions (no need to even import
hamilton
!)Build a
Driver
with your dataflow module and call.execute()
to get results
Hamilton
allows you to start light and opt-in features as your project’s requirements evolve (data validation, scaling compute, testing, etc.). Python is a powerful language with rich editor support and tooling hence why it advocates for “everything in Python” instead of external configs in YAML or JSON. For example, parameters, data assets, and configurations can very much live as dataclasses within a .py
file. Hamilton
was built with an extensive plugin system. There are many extensions, some contributed by users, to adapt Hamilton to your project, and it’s easy for you to extend yourself for further customization.
In fact, Hamilton
is so lightweight, you could even run it inside Kedro
!
Feature comparison¶
Trait |
Kedro |
Hamilton |
---|---|---|
Focuses on |
Tasks (imperative) |
Assets (declarative) |
Code structure |
Opiniated. Makes assumptions about pipeline creation & registration and configuration. |
Unopiniated. |
In-memory execution |
Execute using a KedroSession, but returning values in-memory is hacky. |
Default |
I/O execution |
||
Expressive DAG definition |
â›” |
|
Column-level transformations |
â›” |
âś… |
LLM applications |
â›” Limited by in-memory execution and return values. |
âś… declarative API in-memory makes it easy (RAG app). |
Static DAG visualizations |
Need |
Visualize entire dataflow, execution path, query what’s upstream, etc. directly in a notebook or output to a file ( |
Interactive DAG viewer |
||
Data validation |
||
Executors |
Sequential, async, multiprocessing, multi-threading |
|
Executor extension |
PySpark, Dask, Ray, Modal |
|
Dynamic branching |
â›” |
Parallelizable/Collect for easy parallelization. |
Command line tool (CLI) |
âś… |
âś… |
Node and pipeline testing |
âś… |
âś… |
Jupyter notebook extensions |
âś… |
âś… |
Both Kedro
and Hamilton
provide applications to view dataflows/pipelines and interact with their results. Here, Kedro
provides a lighter webserver and UI, while Hamilton
offers a production-ready containerized application.
Trait |
Kedro Viz |
Hamilton UI |
---|---|---|
Interactive dataflow viewer |
âś… |
âś… |
View code definition of nodes |
âś… |
âś… |
Code versioning |
Git SHA (may be out of sync with actual code) |
Node-level versioning at runtime |
Collapsible view |
âś… |
âś… |
Tag nodes |
âś… |
âś… |
Execution observability |
â›” |
âś… |
Artifact lineage and versioning |
â›” |
âś… |
Column-level lineage |
â›” |
âś… |
Compare run results |
âś… |
âś… |
Rich artifact view |
Preview 5 dataframe rows. Metadata about artifact (column count, row count, size). |
Automatic statistical profiling of various dataframe libraries. |
More information¶
For a full side-by-side example of Kedro and Hamilton, visit this GitHub repository
For more questions, join our Slack Channel