Pipelines

A pipeline is an acyclic directed graph, a data flow graph describing how a set of inputs is processed to produce outputs. Processing occurs at nodes in the graph, data flows along the edges. The simplest nodes are atomic steps: they take input, perform some operation, and produce output. From the perspective of the pipeline, these are black boxes.

JAFPL also provides several mechanisms for constructing compound steps from other steps. These “containers” then become nodes in the graph that contains them.

There are only a few inviolable constraints on the graph:

No loops. It’s a DAG.
Edges can’t cross from inside a container to outside the container (the children of a container are invisible to the siblings of the container). Crossing the other direction is allowed, the children can see their “aunts” and “uncles”.
If steps declare required inputs, those must have edges connected to them.
Edges can only connect nodes in the same graph.

Every output port in JAFPL is read by exactly one step and exactly one step writes to every input port. If the graph described by your edges multiplexes access to any port, a “splitter” or “joiner” step will be inserted automatically as appropriate.

In other words, this graph never happens:

If you attempt to construct that graph, the library will automatically restructure it so that this graph is constructed:

You must provide an implementation class for every atomic node that you construct. That’s the class that will be called upon to perform the work of the step.

The graph is executed with the Akka framework. Actors are created to manage each node. With a few exceptions (conditionals and “catch” containers, for example), all steps “start” at the same time and execute in parallel (to the extent that the graph allows for parallelism, of course). When the last step reports that it is finished, execution finishes.

The actors know the topology of the graph. Any step that has no inputs can run immediately. Steps that have inputs can’t run until all of their inputs have been provided. Inputs pass between the steps as messages. Consider a simple, linear pipeline:

When execution begins, “Do Something” can’t run because it’s waiting for input from the “outside”. “Do Something Else” can’t run because it’s waiting for input from “Do Something.” The process that is running the pipeline pours an arbitrary number of documents into the input port. Each document is received immediately by “Do Something”, which is solely responsible for buffering it or doing whatever is necessary. Eventually, the process running the pipeline closes the input port.

At this point, the actor framework tells “Do Something” to run. It does whatever it does, dropping documents into its output port. The framework turns those documents into messages that are sent immediately to “Do Something Else.”

When “Do Something” finishes, the framework closes each of its output ports and consequently the input port that is reading it. When the last input port on “Do Something Else” is closed, the framework tells it to run. And thus it continues until every step has run. When every step has run, the pipeline is finished. The process running the pipeline can collect its results from the output port(s).

In practice, there are a number of subtleties here. Loops have to buffer inputs so that they’re available to each iteration. A choose must make sure that none of the steps in any of its branches run before a specific branch is selected. Etc.

This library amends the graph and introduces additional steps as necessary to hide all of this complexity. Let’s look at a larger example.

To illustrate how things work, a small repository of demonstration pipelines is provided at https://github.com/ndw/jafpl-demo. One of the larger examples is the “calc” demo. The calc demo application that uses the library to evaluate simple arithmetic expressions. Using a pipeline language for this purpose is a bit crazy, but assuming you understand how arithmetic works, it demonstrates some of the features.

Consider the expression “(1+2)*(3+4+5)”. We don’t often think of arithmetic as a pipeline, but the expressions (1+2) and (3+4+5) are independent and could be executed in parallel. In other words, it’s possible to construct a pipeline to evaluate this expression:

The “Literal” steps are atomic steps that provide numbers. For convenience, they’re labled “number-n” for the number “n”. These are simple atomic steps with no inputs and a single output. When they run, they produce the corresponding number on their output port.

The expression has been decomposed into a series of binary operations (3+4+5 = (3+4)+5 because addition is associative). The “BinaryOp” steps perform simple binary arithetic operations: addition, subtraction, multiplication, and division on two inputs.

There are no external inputs. When this pipeline starts, the integer literal steps all execute in parallel, producing one output each. The additions can also proceed in parallel. Then the multiplication computes the final sum, “36”, which is output and printed by the demo.

To see what happens when inputs are required, consider this expression: “(1+2)*$foo”. Here $foo is an additional input to the expression. Steps have inputs and outputs and (almost) nothing else.

All atomic steps work basically the same way. There’s also a provision for declaring a dependency edge between two nodes that aren’t connected by data flow. The semantics of the dependency are that the dependent step will not be run before the step it depends on has finished.

JAF PL

Pipelines