Skip to content

Graph Pattern Matching with Gremlin 1.1

June 15, 2011

Gremlin 1.1 was released on June 15, 2011. A major aspect of this release includes traversal-based graph pattern matching. This post provides a review of this functionality.

NOTE: For those that are more than casually interested, the GraphML representation of the graph diagrammed above and used throughout the examples is provided here. To evaluate the examples, download Gremlin 1.1 and load the graph with the following commands.

g = new TinkerGraph() // in-memory graph database
g.loadGraphML('traversal-pattern-example.xml')

Traversing vs. Pattern Matching

There are two schools of thought regarding graph querying: graph traversing and graph pattern matching. A graph traversal is usually accomplished by

  1. determining a set of root/start vertices,
  2. emanating from those vertices according to some an abstract path description,
  3. and finally, yielding some side-effect in the process (e.g. the end of the path, intermediate computations, etc.).

The diagram above provides a simple example of a graph traversal in Gremlin. The root is vertex 1 and the abstract path description describes what the root’s friends have purchased. The result of this traversal yields the vertices 3, 4, and 9 (i.e. the side-effect is the end of the path).

A graph pattern match, on the other hand, is accomplished by providing an abstract (sub)graph pattern with any number of grounded instances (vertices, edges in the graph) and variables (called bindings). The query engine will then determine all subgraphs of the larger graph that fit that pattern. In the diagram above, the popular pattern match language SPARQL is used, where 1, knows, and bought serve as grounds and ?x and ?y serve as variables. Given the query in the diagram, what is returned is a table of results.

--------------
| ?x   | ?y   |
|-------------|
| v[8] | v[9] |
| v[2] | v[4] |
| v[2] | v[3] |
 -------------

Beyond the declarative/imperative aspects of the two models, what differentiates graph traversing and graph pattern matching is how state is saved (variables are bound) through the course of an evaluation. In order to bridge these two approaches, Gremlin 1.1 now provides a table data structure and the ability to name steps in a path. These two features conveniently provide traversal-based graph pattern matching. Now it is possible to emulate the behavior of SPARQL in property graphs while simultaneously, leveraging the traversal aspects of Gremlin.

SELECT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y 
}

--------------
| ?x   | ?y   |
|-------------|
| v[8] | v[9] |
| v[2] | v[4] |
| v[2] | v[3] |
 -------------
gremlin> t = new Table()                                              
gremlin> g.v(1).out('knows').as('x').out('bought').as('y').table(t)
==>v[9]
==>v[4]
==>v[3]
gremlin> t
==>[x:v[8], y:v[9]]
==>[x:v[2], y:v[4]]
==>[x:v[2], y:v[3]]

In the Gremlin example above, the path expression yields the end of the path (vertices 3, 4, and 9). However, the table t is lazily filled with the intermediate step data as demarcated by the as steps. In this way, t is the table of variable bindings.

NOTE: For those familiar with Gremlin 0.7–1.0, the paths step allows for similar functionality (see Paths Pattern). The difference being that with table, steps can be named and a convenient table data structure can be constructed that provides methods for slicing and dicing a table and for post-processing table entries. Moreover, as will be demonstrated, there are other ramifications afforded that are difficult to accomplish simply with paths.

Graph Pattern Matching Examples

What has been presented thus far are only the basics. By making use of other steps in Gremlin and the methods afforded by Table, sophisticated pattern matching behavior can be effected. There are many possibilities now that traversing and pattern matching are combined. A “grab bag” of useful and far-reaching examples is provided below.


Range Filter as Limit and Offset

In SPARQL there is the LIMIT and OFFSET clause which can be used to reduce the number of results returned.

SELECT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y 
}
LIMIT   5
OFFSET  10

In Gremlin, use a range filter to limit the number of rows added to the table.

g.v(1).out('knows').as('x').out('bought').as('y')[10..15].table(t)

Of course, for the simple “knows/bought”-graph diagrammed, there are not 15 results.


Filter Steps as Pattern Match Filters

Graph pattern matching languages usually allow for the application of filters. A filter will match all patterns where some literal of the patten is checked to make sure some predicate holds (e.g. age is greaten than 21, name starts with “M”, etc.). Below, a SPARQL query is provided where the name of vertex 1’s friends who bought products must start with “pe” in order to match the pattern.

SELECT ?x ?y WHERE {
  1 knows ?x .
  ?x name ?name .
  ?x bought ?y .
  FILTER regex(?name, "pe.*")
}

By using standard Gremlin filters within a path description, the same result is returned.

g.v(1).out('knows').filter{it.name.matches('pe.*')}.as('x').out('bought').as('y').table(t)

   [x:v[8], y:v[9]]

Unique to Return Distinct Results

In SPARQL, its possible to remove duplicates using DISTINCT.

SELECT DISTINCT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y .
}

With Gremlin, there is the method Table.unique(). Once a table of results has been generated, evaluate Table.unique(). The example below is not based on the diagrammed graph as there are no duplicates in that example.

gremlin> t
==>[b, 2]
==>[a, 1]
==>[a, 1]
gremlin> t.unique()
==>[b, 2]
==>[a, 1]

Sort to Order Results

In SPARQL, its possible to order results using ORDER.

SELECT DISTINCT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y .
}
ORDER BY ?x

With Gremlin, there is the method Table.sort(). Once a table of results has been generated, evaluate Table.sort(). Again, the example below is not based on the diagrammed graph.

gremlin> t
==>[b, 2]
==>[a, 1]
==>[a, 1]
gremlin> t.sort()
==>[a, 1]
==>[a, 1]
==>[b, 2]
gremlin> t.unique().sort()
==>[a, 1]
==>[b, 2]

NOTE: For both unique() and sort() there are overloaded methods that allow for closures and comparators to be added to accomplish behaviors like arbitrary uniqueness criteria, reverse sort, etc. For those familiar with Gremlin, a cap can be used to do inline sorting and uniquing.


Multiple Patterns within a Single Traversal

In all the examples thus far, the table step has been the last step. This need not be the case. For example, it is possible to do a pattern match in the beginning of an expression and still accomplish some other traversal computation in the latter part. Moreover, it is possible to have multiple tables within the same expression and in doing so, match multiple patterns within the same traversal. In the example below, table t1 refers to vertex 1 and his friends. Table t2 refers to vertex 1’s friends and their purchased products. Note that the second table step is provided a list of as names to index (i.e. y and z).

g.v(1).as('x').out('knows').as('y').table(t1).out('bought').as('z').table(t2, ['y','z'])

t1
   [x:v[1], y:v[8]]
   [x:v[1], y:v[2]]
t2
   [y:v[8], z:v[9]]
   [y:v[2], z:v[4]]
   [y:v[2], z:v[3]]

Using Table Column Closures

A column closure can be evaluated over an entry in order to process the result prior to insertion into the table. In the contrived example below, instead of saving vertices, the names of the friends and the lengths of the names of the purchased products are inserted into the table.

                                            
g.v(1).out('knows').as('x').out('bought').as('y').table(t){it.name}{it.name.length()}

   [x:peter, y:7]
   [x:pierre, y:6]
   [x:pierre, y:4]

This can also be effected in a post-processing manner by using the Table.apply(Closure[]) method: t.apply{it.name}{it.name.length()}.


Backtracking and Pattern Matching

A neat feature of Gremlin (inspired by XPath) is the ability to backtrack (see Backtrack Pattern). With the back step, its possible to explore a particular path and then back up to a previous part of the path before moving forward again. Visually, this is like going down one fork of a path, returning to the fork and going down another fork in the path–all within a single linear expression. In the example below, the name of the friend, the number of friends each friend has, and the products they purchased are saved to the resultant table.

g.v(1).out('knows').name.as('x').back(1).transform{it.out('knows').count()}.as('y').back(1).out('bought').as('z').table(t)

   [x:peter, y:1, z:v[9]]
   [x:pierre, y:0, z:v[4]]
   [x:pierre, y:0, z:v[3]]

This example is a bit contrived as a similar behavior can be effected using the table column closures model discussed previous. However, when back is utilized for its filtering capabilities, backtracking and pattern matching play off each other in an interesting way.


Recursive Pattern Matching

The simple “knows/bought”-graph used thus far is not complex enough to describe recursive pattern matching. Therefore, a new graph is needed. Imagine an genealogy tree like the one diagrammed below.

g.v(1).as('x').out('parent').as('y').table(t).loop('x'){it.out.count() != 0}

In the Gremlin example above, the loop step will loop over out(‘parent’).as(‘y’) until it reaches the top of the tree (see Loop Pattern). The table t will contain all the ancestor relationships between vertex 1 all the way back to the first parent (vertex infinity). For example, a subset of the table is:

[x:1, y:2]
[x:1, y:3]
[x:1, y:4]
[x:1, y:5]
...

The ability to loop through sections of an expression while, at the same time, matching patterns seen within the recursion is a major boon achieved when mixing graph traversing and graph pattern matching.


Bonus Examples

The examples presented have demonstrated functionality in isolation in order to articulate common use cases with the clearest possible syntax. However, as many know, Gremlin can get a bit “crazy” once you get the hang of it — it’s been called “Perl for graphs.” For fun, lets get crazy.

PROBLEM: Here is a pattern match example. Try and figure out what it computes.

gremlin> g.v(1).out('knows').as('x').out('bought').as('y').table(new Table()){it.name}{it.name}.cap.transform{it.unique().sort{a,b -> a.get(1) <=> b.get(1)}.apply{it.substring(0,2)}{it}} >> 1
==>[pe, arduino]
==>[pi, fpga]
==>[pi, tether]

SOLUTION: Get vertex 1’s friends and what they have purchased. Insert those into a new table with the entries being the name of the friends and purchased products. Cap the table step/pipe in order to emit the table, not the end of the path. Transform the table into a de-duplicated version sorted by the names of the products with the first column being the first two characters of the person’s name. Finally, return the single table.

Conclusion

Gremlin 1.1 boasts named steps. With the as step, it’s possible to name a particular step in the path and use it with back, optional, loop, and table. For those using Gremlin previous to 1.1, named steps is a welcome addition that makes Gremlin code easier to read. However, the main achievement of Gremlin 1.1 is the ability to do graph pattern matching using graph traversal techniques. This development unites two seemingly disparate graph query worlds. It’s possible to now evaluate a complicated traversal over a graph and at the same time, save particular aspects of that traversal to a table result set — two birds with one stone.

References

This blog post used examples from SPARQL to frame Gremlin’s 1.1 graph pattern matching constructs. Note that there are many other graph languages out there in the wonderful world of graphs — more than those mentioned below. Finally, please see the Pattern Match Pattern documentation for more on Gremlin graph pattern matching.

  • Ripple: An RDF Path Language
  • Cypher: A Property Graph Pattern Match Language
  • Versa: An RDF Path Language
  • RQL: An RDF Pattern Match Language
  • About these ads
    Follow

    Get every new post delivered to your Inbox.

    Join 137 other followers

    %d bloggers like this: