Skip to content

On the Nature of Pipes

August 3, 2011

Pipes is a data flow framework developed by TinkerPop. The graph traversal language Gremlin is a Groovy-based domain-specific language for processing Blueprints-enabled graph databases with Pipes. Since the release of Pipes 0.7 on August 1, 2011, much of the functionality in Gremlin has been generalized and made available through Pipes. This has opened up the door for other JVM languages (e.g. JRuby, Jython, Clojure, etc.) to serve as host languages for graph traversal DSLs. In order to promote this direction, this post will explain Pipes from the vantage point of Gremlin.

An Example Graph Dataset

A graph is a data structure made up of vertices and edges. A property graph is a popular term for a graph data model that has labeled edges and arbitrary key/value pairs associated with both vertices and edges (read more here). In graph theoretic language, a property graph is known as a directed, attributed, multi-relational graph.

All the of the graph traversal examples presented in this post make use of the above diagrammed graph. A minimally extended version of this dataset is distributed with Gremlin and is available at this location (represented in GraphML). For readers that have Gremlin 1.2 downloaded, it’s possible to load up this graph and go through each of the following examples using the Gremlin console. The in-memory graph database TinkerGraph is used throughout. However, note that Gremlin, Pipes, and Blueprints currently work for the following graph database vendors: Neo4j, OrientDB, DEX, and Sail-based RDF stores (e.g. Stardog).

gremlin$ ./gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> g.loadGraphML('data/graph-example-1.xml')
==>null
gremlin> 

Three Types of Pipes

The Pipes API is a collection of atomic “pipes.” A pipe is a simple function that maps an input to an output. The input to a pipe is called a “start” and the output is called an “end.” The general signature of a pipe is Pipe<S,E>. By chaining pipes together, complex computations can be effected. There are 3 general types of pipes.

  1. transform: Given an object of type S, emit an object of type E (see JavaDoc).
  2. filter: Given an object of type S, emit it or not (see JavaDoc).
  3. sideEffect: Given an object of type S, emit it, but yield some computational side-effect in the process (see JavaDoc).

The sections to follow will explain these general forms with specific examples over the example dataset diagrammed previous.

Transform Pipes

During a transformation, an input of type S is mapped/transformed to an output of type E. Below is a basic example where all the steps/pipes used are transformational. The term “step” is used in Gremlin (see documentation). The term “pipe” is used in Pipes. The two terms are nearly analogous.

gremlin> g.v(1).out.name
==>vadas
==>lop
==>josh

In Gremlin, g.v(1) is a method call that returns vertex 1 from the graph. That vertex is passed to the out step (OutPipe) which yields vertices 2, 3, and 4. Finally, the name step (PropertyPipe(“name”)) yields the string name properties of those 2, 3, and 4 vertices.

gremlin> g.v(1).out
==>v[2]
==>v[3]
==>v[4]
gremlin> g.v(1).out.name
==>vadas
==>lop
==>josh

It is important to note that Pipes is not “set-based” in the sense that vertex 2, 3, and 4 are generated before passing those vertices off to the name step (the next stage in the pipeline). On the contrary, vertex 2 is generated, then vadas is generated. Next vertex 3 is generated and then lop. Finally, vertex 4 is generated and then josh. Pipes is a lazy evaluation framework in that as the pipeline is iterated for results, the mappings are computing as needed.

One of the core methods of a pipe is List Pipe.getPath(). This method returns the transformations that an object underwent through a pipeline. This method is demonstrated using the Gremlin step paths (PathPipe).

gremlin> g.v(1).out.name.paths
==>[v[1], v[2], vadas]
==>[v[1], v[3], lop]
==>[v[1], v[4], josh]

Note that any Gremlin expression has a string representation that exposes the underlying pipes being used to represent the expression.

gremlin> g.v(1).out.name.paths.toString()
==>[OutPipe, PropertyPipe(name), PathPipe] 

Filter Pipes

Filter steps/pipes are used to remove particular objects from the flow of a computation. The example below demonstrates one of the significant new features of Pipes 0.7. In Pipes 0.7, there is a PipeClosure (see JavaDoc) interface which can be used with closure-based pipes to provide dynamic functionality. In the examples below filter and transform are the most generic type of filter- and transformation-based pipes, where the filtering and transformation algorithm are specified with a Groovy/Java code snippet. In essence, the provided closure “closes” the meaning/function of the pipe.

Closures typically appear in languages in which functions are first-class values—in other words, such languages allow functions to be passed as arguments, returned from function calls, bound to variable names, etc., just like simpler types such as strings and integers. — Wikipedia

gremlin> g.v(1).out('knows')
==>v[2]
==>v[4]
gremlin> g.v(1).out('knows').filter{it.age < 30} 
==>v[2]
gremlin> g.v(1).out('knows').filter{it.age < 30}.name
==>vadas
gremlin> g.v(1).out('knows').filter{it.age < 30}.name.transform{it.length()}
==>5
gremlin> g.v(1).out('knows').filter{it.age < 30}.name.transform{it.length()}.toString()
==>[OutPipe([knows]), FilterClosurePipe, PropertyPipe(name), TransformClosurePipe]

This example demonstrates how other JVM languages like JRuby (which support closures) and Clojure (which supports anonymous functions) can take advantage of the Pipes framework. In Gremlin, a Groovy closure is wrapped in a GroovyPipeClosure which makes the Groovy closure behave according to the PipeClosure interface and thus, pluggable into Pipes. Any other JVM language with similar features can capitalize on this same pattern.

Side-Effect Pipes

The third type of pipe is a side-effect pipe. A side-effect pipe implements SideEffectPipe<S,T> and takes an object of type S and emits that object. However, in the process, it yields some computational side-effect that can, for instance, be used later in the traversal computation. That side-effect object is of type T. The example below makes use of the most generic side-effect pipe: SideEffectClosurePipe. The example below calculates vertex 1’s co-developers — that is, those vertices that have worked with vertex 1 on a project. Note that a vertex can not be their own co-developer and thus, sideEffect and filter are used appropriately to express this abstract co-developer relationship.

gremlin> g.v(1).sideEffect{x=it}
==>v[1]
gremlin> g.v(1).sideEffect{x=it}.out('created')
==>v[3]
gremlin> g.v(1).sideEffect{x=it}.out('created').in('created')
==>v[1]
==>v[4]
==>v[6]
gremlin> g.v(1).sideEffect{x=it}.out('created').in('created').filter{it != x}
==>v[4]
==>v[6]
gremlin> g.v(1).sideEffect{x=it}.out('created').in('created').filter{it != x}.toString()
==>[SideEffectClosurePipe, OutPipe([created]), InPipe([created]), FilterClosurePipe]

In the example above, the variable x stores the current vertex being pushed through the SideEffectClosurePipe. This variable is later accessed by the FilterClosurePipe to filter out vertices that match the source of the traversal. In other words, path traversals that make use of history can be effected using such a pattern (i.e. non-regular graph traversals).

Branch and Meta Pipes

There are two pipe sub-types that should be discussed: branch and meta pipes. A branch pipe is any pipe that alters the flow of a computation based on some decision criteria (see JavaDoc). The ifThenElse step of Gremlin demonstrates the behavior of the closure-based IfThenElsePipe.

gremlin> g.v(1).out('knows').ifThenElse{it.age < 30}{it.name}{it.out('created').name}           
==>vadas
==>ripple
==>lop

The ifThenElse step takes 3 closures: a condition-closure, a then-closure, and finally, an else-closure. Other branch pipes include copySplit and the respective merging pipes which take multiple branches and yield a single flow: fairMerge and exhaustMerge.

The meta-pipes are perhaps the most bewildering yet most powerful of the pipes in Pipes (see JavaDoc). A meta-pipe is a pipe that maintains pipes internal to it. As such, the meta-pipe can reason on the results of its internal pipes and then effect an appropriate behavior at the meta-pipe level. To better explain the concept of a MetaPipe, backtracking and looping are demonstrated.

gremlin> g.v(1).out('knows').name
==>vadas
==>josh
gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}
==>vadas
gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}.back(2)
==>v[2]

The above Gremlin code snippet and diagram demonstrate the behavior of back (BackFilterPipe). If an object reaches back, emit the object that it was n steps ago. In the example above, n is 2 and thus, two steps ago from the string vadas was vertex 2. While this linear sequence articulates backtracking, the diagram below better explains how this computation is implemented in Pipes.

gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}.back(2).toString()
==>[OutPipe([knows]), BackFilterPipe[[PropertyPipe(name), FilterClosurePipe]]]

While not diagrammed, it is possible to use named-steps instead of integers (e.g. 2) to denote how many steps to back up to. As an expression becomes more complex, using named-steps makes the expression more readable and manageable.

gremlin> g.v(1).out('knows').as('here').name.filter{it[0]=='v'}.back('here')
==>v[2]

Finally, to conclude this section, the meta-pipe LoopPipe is explained. LoopPipe is both a meta-pipe and a branch pipe in that is can control the flow of objects and, like BackPipe makes use of internal pipes during its operation.

gremlin> g.v(1).out.loop(1){it.loops < 3}
==>v[5]
==>v[3]

In the example above, the loop step passes the current object to the provided “while”-based condition-closure. If the closure yields true, the object is inserted into the pipe from n steps ago. Moreoever, metadata is appended to the object. In the example above the loops metadata denotes how many times the object path has gone through the loop pipe. Thus, when the object path has gone through 3 times, the condition-closure fails and the object is passed on to the next stage of the computation/pipeline.

gremlin> g.v(1).out.loop(1){it.loops < 3}.name
==>ripple
==>lop

While a linear interpretation is possible, underneath the hood, loop is a meta-pipe. The diagram below better represents this.

gremlin> g.v(1).out.loop(1){it.loops < 3}.name.toString()
==>[LoopPipe[[OutPipe]], PropertyPipe(name)]

Conclusion

There are numerous pipes that come with the Pipes distribution. These pipes can be mixed and matched to yield lazy graph traversals. A description of all the pipes in Pipes 0.7 is provided in the associated JavaDoc. Note that it is easy to create a pipe by simply extending AbstractPipe and implementing the protected method AbstractPipe.processNextStart(). This is handy for developers wanting to create domain-specific pipes and use them in Pipes-based DSLs like Gremlin. Two trivial examples are provided below.

public class NumCharsPipe extends AbstractPipe<String,Integer> {
  public Integer processNextStart() {
    return this.starts.next().length();
  }
}

public class MoreThanFourCharsPipe extends AbstractPipe<String,String> implements FilterPipe<String> {
  public String processNextStart() {
    while(true) {
      String s = this.starts.next();
      if(s.length() > 4)
        return s;
    }
  }
}
gremlin> Gremlin.defineStep('numChars', [Pipe], {new NumCharsPipe()})
==>null
gremlin> Gremlin.defineStep('moreThan4Chars', [Pipe], {new MoreThanFourCharsPipe()})
==>null
gremlin> g.v(1).out.name.numChars                                                
==>5
==>3
==>4
gremlin> g.v(1).out.name.moreThan4Chars
==>vadas

NOTE: Gremlin provides an easier way to implement user-defined steps. The example above is provided to demonstrate the boundary between the Java-based Pipes and the Groovy-based Gremlin.

The TinkerPop crew plans to further open up the Pipes data flow model to other JVM languages (for example, see Pacer). Pipes 0.7 introduced the flexible PipeClosure concept which will make it easy for DSL designers to make use of Pipes as the backend engine of their graph traversal language. Moreover, with this model, there is a central point for optimization and feature development which will hopefully ensure a bright future for lazy graph traversal languages implemented on the JVM.

From → Gremlin, TinkerPop

Follow

Get every new post delivered to your Inbox.

Join 138 other followers

%d bloggers like this: