Graphs Ensure Something from Nothing
There is a reason that there is something as opposed to nothing. However, it is simple for something to be nothing if that something is but one thing. One thing could be nothing even if it is not no thing. For that one thing must be both that which is called something and nothing.
To not be one thing, something must be some things. These things sum to something and make something not nothing as now there are some things which, as it stands, are not at all no thing.

To be things, things must not simply be, but instead be things in reference to some things. For a thing that simply is is simply one thing and that one thing is some thing called nothing.

Related things is nothing but some things related. However, things related is something that can be related by some thing. As such, some related things of the related something stave the fact that it is all simply nothing.
The more there are some things, the further something is from nothing…
TinkerPop as of Spring 2011
TinkerPop is an open source software development group focused on technologies in the emerging graph database ecosystem. The group started in the Fall of 2009 and has been actively pushing out technologies since. This post highlights the current state of some of the more popular products made freely available as of Spring 2011.
Blueprints: A Property Graph Model Interface
Blueprints was one of the first TinkerPop products. There are numerous graph database providers. The usual suspects include Neo4j, OrientDB, DEX, InfiniteGraph, Sones, HyperGraphDB, and others. Each database has its own particular graph data model and tools for their respective manipulation and analysis. Blueprints is an attempt to provide a common API. In this way, any tool written to the Blueprints API can work over various graph database providers. The hope is to prevent vendor lock-in as well as make software that is generally useful to developers in the graph scene. Blueprints is relatively simple to implement and has few strict requirements of the underlying graph database. There are some conveniences provided that include: representing an RDF store as a graph database, representing a graph database as an RDF store, support for exposing a graph database as a JUNG graph, etc. The future of the project includes more implementations by more graph database vendors and the development of more utilities and tools that make use of the Blueprints API.
Pipes: A Dataflow Framework using Process Graphs
Pipes was a later development in the TinkerPop suite. The purpose of Pipes is to provide atomic operations that are commonly used when doing a graph traversal. It was realized that every property graph analysis algorithm could be represented by a chain of operations that are an instance of one of the following categories.
- Transform: Take an object and turn it into another object. For example, given a vertex, return its outgoing edges.
- Filter: Take an object and decide whether to return it or not. For example, given an edge, should it be traversed?
- Side Effect: Take an object and return it. However, yield some side effect in the process. For example, count the number of times a particular vertex has been seen.
Moreover, such operations can be lazily evaluated in many cases and have the convenient property of being able to be chained together using function composition.

Pipes has come to serve as a foundation for the development of property graph traversal algorithms and languages.
Gremlin: A Graph Traversal Language
Gremlin was one of the primary reasons why TinkerPop was formed. This is apparent when realizing that TinkerPop’s mailing list is called Gremlin Users. The beauty of a property graph (and multi-relational graphs in general) is that it can be traversed in numerous ways. There are many paths through a graph when one can filter on edges, check properties, update a counter, check a counter, jump back 2 steps, etc. As such, there is a need for a language that can concisely express such traversals. Gremlin is one such language. Gremlin is a DSL written in Groovy that compiles down to Pipes. In short, Gremlin is a language for constructing data flow pipelines to traverse a graph.
Rexster: A RESTful Graph Shell
Rexster was thought of when TinkerPop was first formed, but only came to be a mature product much later. Rexster exposes any Blueprints-enabled graph database as a REST-based server using Jersey (deployable through Tomcat). With Rexster, developers resolve semantically rich URLs and, in return, get JSON formatted responses. There is also a JavaScript-based web administration tool called “The Dog House.” It allows users to visually browse a graph and to interact with an embedded Gremlin console.
The Future of TinkerPop
There are many projects that get added and then removed from the TinkerPop suite. What sounded like a good idea at one point, loses its steam as more is learned. However, one project in the works is Frames and it just might stand the test of time. Frames provides a schema-layer to any Blueprints-enabled graph database. In contrast, RDF stores have rich schema layers via RDFS and OWL. As of today, the graph database scene relies primarily on the semantics of the high-level languages that the graph database was developed in. Frames hopes to free this constraint by allowing developers to easily create schemas with varied semantics and rules of inference.
TinkerPop is a relatively new software development group. The team has grown over the last 2 years and with each new member comes new expertise. Hopefully more developers join and help to contribute to the wonderful world of graphs.
Knowledge Representation and Reasoning with Graph Databases
A graph database and its ecosystem of technologies can yield elegant, efficient solutions to problems in knowledge representation and reasoning. To get a taste of this argument, we must first understand what a graph is. A graph is a data structure. There are numerous types of graph data structures, but for the purpose of this post, we will focus on a type that has come to be known as a property graph. A property graph denotes vertices (nodes, dots) and edges (arcs, lines). Edges in a property graph are directed and labeled/typed (e.g. “marko knows peter”). Both vertices and edges (known generally as elements) can have any number of key/value pairs associated with them. These key/value pairs are called properties. From this foundational structure, a suite of questions can be answered and problems solved.
Object Modeling

The property graph data structure is nearly identical in form to the object graphs of object oriented programming. Take a collection of objects, remove their methods, and you are left with a property graph. An object’s fields are either primitive and in which cases serve as properties or they are complex and in which case serve as references to other objects. For example, in Java:
class Person {
String name;
Integer age;
Collection<Person> knows;
}
The name and age properties are vertex properties of the particular person instance and the knows property refer to knows-labeled edges to other people. Emil Eifrem of Neo Technology espouses the view that property graphs are “whiteboard friendly” as they are aligned with the semantics of modern object oriented languages and the diagramming techniques used by developers. A testament to this idea is the jo4neo project by Taylor Cowan. With jo4neo, Java annotations are elegantly used to allow for the backing of a Java object graph by the Neo4j graph database. Beyond the technological benefits, the human mind tends to think in terms of objects and their relations. Thus, graphs may be considered “human brain friendly” as well.
Given an object graph, questions can be answered about the domain. In the graph traversal DSL known as Gremlin, we can ask questions of the object graph:
// Who does Marko know?
marko.outE('knows').inV
// What are the names of the people that Marko knows?
marko.outE('knows').inV.name
// What are the names and ages of the people that Marko knows?
marko.outE('knows').inV.emit{[it.name, it.age]}
// Who does Marko know that are 30+ years old?
marko.outE('knows').inV{it.age > 30}
Concept Modeling

From the instances that compose a model, there may exist abstract concepts. For example, while there may be book instances, there may also be categories for which those books fall–e.g. science fiction, technical, romance, etc. The graph is a flexible structure in that it allows one to express that something is related to something else in some way. These somethings may be real or ethereal. As such, ontological concepts can be represented along with their instances and queried appropriately to answer questions.
// What are the parent categories of history?
x = []; history.inE('subCategory').outV.aggregate(x).loop(3){!it.equals(literature)}; x
// How many descendant categories does fiction have?
c = 0; fiction.outE('subCategory').inV.foreach{c++}.loop(3){true}; c
// Is romance at the same depth as history?
c = 0; romance.inE('subCategory').outV.loop(2){c++; !it.equals(literature)}.outE('subCategory').inV.loop(2){c--; !it.equals(history)}; c == 0
Automated Reasoning

From the explicit objects, their relationships, and their abstract categories, reasoning processes can be enacted. A tension that exists in graph modeling is what to make explicit (structure) and what to infer through traversal (process). The trade-off is between, like much of computing, space and time. If there exists an edge from a person to their coauthors, then its a single hop to get from that person to his or her coauthors. If, on the other hand, coauthors must be inferred through shared writings, then a multi-hop step is computed to determine coauthors. Reasoning is the process of making what is implicit explicit. A couple simple reasoning examples are presented below using Gremlin.
// Two people who wrote the same book/article/etc. are coauthors
g.V{x = it}.outE('wrote').inV.inE('wrote').outV.except([x])[0].foreach{g.addEdge(null, x, it, 'hasCoauthor')}
// People who write literature are authors
author = g.addVertex(); author.type='role'; author.name='author'
g.V.foreach{it.outE('wrote').inV[0].foreach{g.addEdge(null, it, author, 'hasRole')} >> -1}
In the examples above, a full graph analysis is computed to determine all coauthors and author roles. However, nothing prevents the evaluation of local inference algorithms.
// Marko's coauthors are those people who wrote the same books/articles/etc. as him
marko.outE('wrote').inV.inE('wrote').outV.except([marko])[0].foreach{g.addEdge(null, x, it, 'hasCoauthor')}
Conclusion
Graphs are useful for modeling objects, their relationships to each other, and the conceptual structures wherein which they lie. From this explicit information, graph query and inference algorithms can be evaluated to answer questions on the graph and to increase the density of the explicit knowledge contained within the graph (i.e. increase the number of vertices and edges). This particular graph usage pattern has been exploited to a great extent in the world of RDF (knowledge representation) and RDFS/OWL (reasoning). The world of RDF/RDFS/OWL is primarily constrained to description logics (see an argument to the contrary here). Description logics are but one piece of the larger field of knowledge representation and reasoning. There are numerous logics that can be taken advantage of. In the emerging space of graph databases, the necessary building blocks exist to support the exploitation of other logics. Moreover, these logics, in some instances, may be used concurrently within the same graphical structure. To this point, the reading list below provides a collection of books that explicate different logics and ideas regarding heterogeneous reasoning. Graph databases provide a green field by which these ideas can be realized.
Further Reading
Brachman, R., Levesque, H., “Knowledge Representation and Reasoning,” Morgan Kaufmann, 2004.
Wang, P., “Rigid Flexibility: The Logic of Intelligence,” Springer, 2006.
Mueller, E.T., “Commonsense Reasoning,” Morgan Kaufmann, 2006.
Minsky, M., “The Society of Mind,” Simon & Schuster, 1988.
MySQL vs. Neo4j on a Large-Scale Graph Traversal

This post presents an analysis of MySQL (a relational database) and Neo4j (a graph database) in a side-by-side comparison on a simple graph traversal. The data set that was used was an artificially generated graph with natural statistics. The graph has 1 million vertices and 4 million edges. The degree distribution of this graph on a log-log plot is provided below. A visualization of a 1,000 vertex subset of the graph is diagrammed above.

Loading the Graph
The graph data set was loaded both into MySQL and Neo4j. In MySQL a single table was used with the following schema.
CREATE TABLE graph (
outV INT NOT NULL,
inV INT NOT NULL
);
CREATE INDEX outV_index USING BTREE ON graph (outV);
CREATE INDEX inV_index USING BTREE ON graph (inV);
After loading the data, the table appears as below. The first line reads: “vertex 0 is connected to vertex 1.”
mysql> SELECT * FROM graph LIMIT 10;
+------+-----+
| outV | inV |
+------+-----+
| 0 | 1 |
| 0 | 2 |
| 0 | 6 |
| 0 | 7 |
| 0 | 8 |
| 0 | 9 |
| 0 | 10 |
| 0 | 12 |
| 0 | 19 |
| 0 | 25 |
+------+-----+
10 rows in set (0.04 sec)
The 1 million vertex graph data set was also loaded into Neo4j. In Gremlin, the graph edges appear as below. The first line reads: “vertex 0 is connected to vertex 992915.”
gremlin> g.E[1..10]
==>e[183][0-related->992915]
==>e[182][0-related->952836]
==>e[181][0-related->910150]
==>e[180][0-related->897901]
==>e[179][0-related->871349]
==>e[178][0-related->857804]
==>e[177][0-related->798969]
==>e[176][0-related->773168]
==>e[175][0-related->725516]
==>e[174][0-related->700292]
Warming Up the Caches
Before traversing the graph data structure in both MySQL and Neo4j, each database had a “warm up” procedure run on it. In MySQL, a “SELECT * FROM graph” was evaluated and all of the results were iterated through. In Neo4j, every vertex in the graph was iterated through and the outgoing edges of each vertex were retrieved. Finally, for both MySQL and Neo4j, the experiment discussed next was run twice in a row and the results of the second run were evaluated.
Traversing the Graph
The traversal that was evaluated on each database started from some root vertex and emanated n-steps out. There was no sorting, no distinct-ing, etc. The only two variables for the experiments are the length of the traversal and the root vertex to start the traversal from. In MySQL, the following 5 queries denote traversals of length 1 through 5. Note that the “?” is a variable parameter of the query that denotes the root vertex.
SELECT a.inV FROM graph as a WHERE a.outV=?
SELECT b.inV FROM graph as a, graph as b WHERE a.inV=b.outV AND a.outV=?
SELECT c.inV FROM graph as a, graph as b, graph as c WHERE a.inV=b.outV AND b.inV=c.outV AND a.outV=?
SELECT d.inV FROM graph as a, graph as b, graph as c, graph as d WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND a.outV=?
SELECT e.inV FROM graph as a, graph as b, graph as c, graph as d, graph as e WHERE a.inV=b.outV AND b.inV=c.outV AND c.inV=d.outV AND d.inV=e.outV AND a.outV=?
For Neo4j, the Blueprints Pipes framework was used. A pipe of length n was constructed using the following static method.
public static Pipeline createPipeline(final Integer steps) {
final ArrayList pipes = new ArrayList();
for (int i = 0; i < steps; i++) {
Pipe pipe1 = new VertexEdgePipe(VertexEdgePipe.Step.OUT_EDGES);
Pipe pipe2 = new EdgeVertexPipe(EdgeVertexPipe.Step.IN_VERTEX);
pipes.add(pipe1);
pipes.add(pipe2);
}
return new Pipeline(pipes);
}
For both MySQL and Neo4j, the results of the query (SQL and Pipes) were iterated through. Thus, all results were retrieved for each query. In MySQL, this was done as follows.
while (resultSet.next()) {
resultSet.getInt(finalColumn);
}
In Neo4j, this is done as follows.
while (pipeline.hasNext()) {
pipeline.next();
}
Experimental Results
The artificial graph dataset was constructed with a “rich get richer“, preferential attachment model. Thus, the vertices created earlier are the most dense (i.e. highest number of adjacent vertices). This property was used to limit the amount of time it would take to evaluate the tests for each traversal. Only the first 250 vertices were used as roots of the traversals. Before presenting timing results, note that all of these experiments were run on a MacBook Pro with a 2.66GHz Intel Core 2 Duo and 4Gigs of RAM at 1067 MHz DDR3. The packages used were Java 1.6, MySQL JDBC 5.0.8, and Blueprints Pipes 0.1.2.
java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)
The following Java Virtual Machine parameters were used:
-Xmx1000M -Xms500M
Below are the total running times for both MySQL (red) and Neo4j (blue) for traversals of length 1, 2, 3, and 4.

The raw data is presented below along with the total number of vertices returned by each traversal—which, of course, is the same for both MySQL and Neo4j given that its the same graph data set being processed. Also realize that traversals can loop and thus, many of the same vertices are returned multiple times. Finally, note that only Neo4j has the running time for a traversal of length 5. MySQL did not finish after waiting 2 hours to complete. In comparison, Neo4j took 14.37 minutes to complete a 5 step traversal.
[mysql steps-1] time(ms):124 -- vertices_returned:11360
[mysql steps-2] time(ms):922 -- vertices_returned:162640
[mysql steps-3] time(ms):8851 -- vertices_returned:2206437
[mysql steps-4] time(ms):112930 -- vertices_returned:28125623
[mysql steps-5] N/A
[neo4j steps-1] time(ms):27 -- vertices_returned:11360
[neo4j steps-2] time(ms):474 -- vertices_returned:162640
[neo4j steps-3] time(ms):3366 -- vertices_returned:2206437
[neo4j steps-4] time(ms):49312 -- vertices_returned:28125623
[neo4j steps-5] time(ms):862399 -- vertices_returned:358765631
Next, the individual data points for both MySQL and Neo4j are presented in the plot below. Each point denotes how long it took to return n number of vertices for the varying traversal lengths.

Finally, the data below provides the number of vertices returned per millisecond (on average) for each of the traversals. Again, MySQL did not finish in its 2 hour limit for a traversal of length 5.
[mysql steps-1] vertices/ms:91.6128847554668
[mysql steps-2] vertices/ms:176.399127537985
[mysql steps-3] vertices/ms:249.286746556076
[mysql steps-4] vertices/ms:249.053599519823
[mysql steps-5] N/A
[neo4j steps-1] vertices/ms:420.740351166341
[neo4j steps-2] vertices/ms:343.122344772028
[neo4j steps-3] vertices/ms:655.507125256186
[neo4j steps-4] vertices/ms:570.360621871775
[neo4j steps-5] vertices/ms:416.00886711325
Conclusion
In conclusion, given a traversal of an artificial graph with natural statistics, the graph database Neo4j is more optimal than the relational database MySQL. However, no attempts have been made to optimize the Java VM, the SQL queries, etc. These experiments were run with both Neo4j and MySQL “out of the box” and with a “natural syntax” for both types of queries.
Property Graph Algorithms

The term property graph has come to denote an attributed, multi-relational graph. That is, a graph where the edges are labeled and both vertices and edges can have any number of key/value properties associated with them. An example of a property graph with two vertices and one edge is diagrammed below.

Property graphs are more complex than the standard single-relational graphs of common knowledge. The reason for this is that there are different types of vertices (e.g. people, companies, software) and different types of edges (e.g. knows, works_for, imports). The complexities added by this data structure (and multi-relational graphs in general, e.g. RDF graphs) effect how graph algorithms are defined and evaluated.
Standard graph theory textbooks typically present common algorithms such as various centralities, geodesics, assortative mixings, etc. These algorithms usually come pre-packaged with single-relational graph toolkits and frameworks (e.g. NetworkX, iGraph).
It is common for people to desire such graph algorithms when they begin to work with property graph software. I have been asked many times:
“Does the property graph software you work on support any of the common centrality algorithms? For example, PageRank, closeness, betweenness, etc.?”
My answer to this question is always:
“What do you mean by centrality in a property graph?”
When a heterogeneous set of vertices can be related by a heterogeneous set of edges, there are numerous ways in which to calculate centrality (or any other standard graph algorithm for that matter).
- Ignore edge labels and use standard single-relational graph centrality algorithms.
- Isolate a particular “slice” of the graph (e.g. the knows subgraph) and use standard single-relational graph centrality algorithms.
- Make use of abstract adjacencies to compute centrality with higher-order semantics.
The purpose of this blog post is to stress point #3 and the power of property graph algorithms. In Gremlin, you can calculate numerous eigenvector centralities for the same property graph instance. At this point, you might ask: “How can a graph have more than one primary eigenvector?” The answer lies in seeing all the graphs that exist within the graph—i.e. seeing all the higher-order, derived, implicit, virtual, abstract adjacencies. Each line below exemplifies point #1, #2, and #3 in the list above, respectively. The code examples use the power method to calculate the vertex centrality rankings which are stored in the map m.
g.V.outE.inV.groupCount(m).loop(3){c++ < 10000} // point #1
g.V.outE[[label:'knows']].inV.groupCount(m).loop(4){c++ < 10000} // point #2
g.V.???.groupCount(m).loop(?){c++ < 10000} // point #3
The ??? on line 3 refers to the fact that ??? can be any arbitrary computation. For example, ??? can be:
outE[[label:'works_for']].inV.inE[[label:'works_for']].outV outE[[label:'works_for']].inV[[name:'ACME']].inE[[label:'works_for']].outV outE[[label:'develops']].inV.outE[[label:'imports']].inV[[name:'Blueprints']].back(7).outE[[label:'works_for']].inV.inE[[label:'works_for']].outV.outE[[label:'develops']].inV.outE[[label:'imports']].inV[[name:'Blueprints']].back(7)
The above expressions have the following meaning:
- Coworker centrality
- ACME Corporation coworker centrality
- Coworkers who import Blueprints into their software centrality
There are numerous graphs within the graph. As such, “what do you mean by centrality?”

These ideas are explored in more detail in the following article and slideshow.
Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms,” Journal of Informetrics, 4(1), pp. 29-41, Elsevier, doi:10.1016/j.joi.2009.06.004, 2009.
Gremlin 0.7 Released

Gremlin is a domain specific language for traversing graphs. Gremlin 0.7 (Gremopoly) is a major improvement over Gremlin 0.6 and below. Gremlin 0.7 makes use of Groovy as its host language. Given that Groovy is a superset of Java, it is possible to seamlessly integrate with Blueprints, Pipes, and the standard Java JDK. There is no more awkward disjoint between Gremlin and the engine it is built upon — namely Java and TinkerPop.
Here is a collection of Gremlin 0.7 code snippets to provide a flavor of the language.
- What are vertex 1′s friends’ names?
g.v(1).outE[[label:'friend']].inV.name
- What are vertex 1′s friends’ friends’ names who are not vertex 1′s friends?
x = [] g.v(1).outE[[label:'friend']].inV.aggregate(x) .outE[[label:'friend']].inV.except(x).name
- What are the names of vertex 1′s friends who are friends with vertex 2?
g.v(1).outE[[label:'friend']].inV .outE[[label:'friend']].inV[[id:2]].back(4).name
- How many steps away from vertex 1 is vertex 2 in the friendship subgraph?
s = 0; r = true; // s will be solution
g.v(1).outE[[label:'friend']].inV.loop(3)
{if(r & it.object==g.v(2)) {s=it.loops; r=false}; r}
These basic examples touch at what is possible when maneuvering a graph with Gremlin. As the graphical domain model becomes more complex and the way in which vertices relate to each other yield increasingly sophisticated abstract relationships, Gremlin becomes an indispensable tool for arbitrary property graph traversal, query, and ultimately, problem-solving.
For those that process RDF, Gremlin provides connectivity to the OpenRDF Sail framework (see Sail and Blueprints). Moreover, with Gremlin, you can traverse the Web of Data in real-time.
The move to Groovy as Gremlin’s host language will speed up the development and ease the maintenance of Gremlin. This is a big step for all of us that love to traverse graphs.







