Functions, nodes & dataflow#

On this page, you’ll learn how Hamilton converts your Python functions into nodes and then creates a dataflow.

Functions#

Hamilton requires you to write your code using functions. To get started, you simply need to:

  • Annotate the type of the function parameters and return value.

  • Specify the function’s dependency with the parameters’ name.

  • Store your code in Python modules (.py files).

Since your code doesn’t depend on special “Hamilton code”, it can be reused any other way you want!

Specifying dependencies#

In Hamilton, you define dependencies by matching parameter names with the names of other functions. Below, the function name and return type A() -> int``match the parameter ``A: int found in functions B() and C().

def A() -> int:
  """Constant value 35"""
  return 35

def B(A: int) -> float:
    """Divide A by 3"""
    return A / 3

def C(A: int, B: float) -> float:
    """Square A and multiply by B"""
    return A**2 * B
../../_images/abc_basic.png

The figure shows how Hamilton automatically assembled the functions A(), B(), and C().

Helper function#

You can prefix a function name with an underscore (_) to prevent it from being included in a dataflow. Below, A() and B() are part of the dataflow, but _round_three_decimals() isn’t.

def _round_three_decimals(value: float) -> float:
    """Round value by 3 decimals"""
    return round(value, 3)

def A(external_input: int) -> int:
    """Modulo 3 of input value"""
    return external_input % 3

def B(A: int) -> float:
    """Divide A by 3"""
    b = A / 3
    return _round_three_decimals(b)

Function naming tips#

Hamilton strongly agrees with the Zen of Python #2: “Explicit is better than implicit”. Meaningful function names help document what functions do, so don’t shy away from longer names. If you were to come across a function named life_time_value versus ltv versus l_t_v, which one is most obvious? Remember your code usually lives a lot longer than you ever think it will.

Unlike the common practice of including meaningful verbs in function names (e.g., get_credentials(), statistical_test()), with Hamilton, the function name should more closely align with nouns. That’s because the function name determines the node name and how data will be queried. Therefore, names that describe the node result rather than its action may be more readable (e.g., credentials(), statistical_results()).

Nodes#

A node is a single “step” in a dataflow. Hamilton users write Python functions that Hamilton converts into nodes. They never directly create nodes.

Anatomy of a node#

The following figure and table detail how a Python function maps to a Hamilton node.

../../_images/function_anatomy.png

id

Function components

Node components

1

Function name and return type annotation

Node name and type

2

Parameter(s) name and type annotation

Node dependencies

3

Docstring

Description of the node return value

4

Function body

Implementation of the node

Since functions almost always map 1-to-1 to nodes, the two terms are used interchangeably. However, there are exceptions that we’ll discuss later in this guide.

Dataflow#

From a collection of nodes, Hamilton automatically assembles the dataflow. For each node, it creates edges between itself and its dependencies, resulting in a dataflow (or a graph in more mathematical terms).

From the user perspective, you just have to give Hamilton a Python module containing your functions for it to generate your dataflow! This is a key difference with popular orchestration / pipeline / workflow frameworks (Airflow, Kedro, Prefect, VertexAI, SageMaker, etc.)

How other frameworks build graphs#

In most frameworks, you first define steps / tasks / components. Then, you need to create your dataflow by explicitly specifying the relationship between each node.

Readability#

In that case, the code for step A doesn’t tell you how it relates step B or the broader dataflow. Hamilton solves this problem by tying functions, nodes, and dataflow definitions in a single place. The ratio of reading to writing code can be as high as 10:1, especially for complex dataflows, so optimizing for readability is very high-value.

Maintainability#

Typically, editing a dataflow (new feature, debugging, etc.) alters both what a node does and how the dataflow is structured. Consequently, changes to step A require you to manually ensure consistent edits to the definition of dataflows, which is likely in another file. In enterprise settings, it can become difficult to discover and track every location step A is used (potentially 10s or 100s of pipelines), increasing the likelihood of breaking changes. Hamilton avoids entirely this problem because changes to the node definitions, and thus the dataflow, will propagate to all places this code is used. This greatly improves maintainability and development speed by facilitating code changes.

Recap#

  • Users write Python functions into modules with proper naming and typing

  • Helper functions use an underscore prefix (e.g., _helper())

  • Hamilton converts functions into nodes

  • Hamilton automatically assembles nodes into a dataflow

Next step#

So far, we learned how to write Hamilton code for our dataflow. Next, we’ll explore how we can effectively

  1. Convert a Python module into dataflow

  2. Visualize a dataflow

  3. Execute a dataflow

  4. Gather and store results of a dataflow