Skip to content

Graph Systems Architect




Graph technologies
      ease the modeling of your domain and
      improve the simplicity and speed of your queries.

On Graph Computing

The concept of a graph has been around since the dawn of mechanical computing and for many decades prior in the domain of pure mathematics. Due in large part to this golden age of databases, graphs are becoming increasingly popular in software engineering. Graph databases provide a way to persist and process graph data. However, the graph database is not the only way in which graphs can be stored and analyzed. Graph computing has a history prior to the use of graph databases and has a future that is not necessarily entangled with typical database concerns. There are numerous graph technologies that each have their respective benefits and drawbacks. Leveraging the right technology at the right time is required for effective graph computing.

Structure: Modeling Real-World Scenarios with Graphs

A graph (or network) is a data structure. It is composed of vertices (dots) and edges (lines). Many real-world scenarios can be modeled as a graph. This is not necessarily inherent to some objective nature of reality, but primarily predicated on the the fact that humans subjectively interpret the world in terms of objects (vertices) and their respective relationships to one another (edges) (an argument against this idea). The popular data model used in graph computing is the property graph. The following examples demonstrate graph modeling via three different scenarios.

A Software Graph

Stephen is a member of a graph-oriented engineering group called TinkerPop. Stephen contributes to Rexster. Rexster is related to other projects via software dependencies. When a user finds a bug in Rexster, they issue a ticket. This description of a collaborative coding environment can be conveniently captured by a graph. The vertices (or things) are people, organizations, projects, and tickets. The edges (or relationships) are, for example, memberships, dependencies, and issues. A graph can be visualized using dots and lines and the scenario described above is diagrammed below.

A Discussion Graph

Matthias is interested in graphs. He is the CTO of Aurelius and the project lead for the graph database Titan. Aurelius has a mailing list. On this mailing list, people discuss graph theory and technology. Matthias contributes to a discussion. His contributions beget more contributions. In a recursive manner, the mailing list manifests itself as a tree. Moreover, the unstructured text of the messages make reference to shared concepts.

A Concept Graph

A graph can be used to denote the relationships between arbitrary concepts, even the concepts related to graph. For example, note how concepts (in italics) are related in the sentences to follow. A graph can be represented as an adjacency list. The general way in which graphs are processed are via graph traversals. There are two general types of graph traversals: depth-first and breadth-first. Graphs can be persisted in a software system known as a graph database. Graph databases organize information in a manner different from the relational databases of common software knowledge. In the diagram below, the concepts related to graph are linked to one another demonstrating that concept relationships form a graph.

A Multi-Domain Graph

The three previous scenarios (software, discussion, and concept) are representations of real-world systems (e.g. GitHub, Google Groups, and Wikipedia). These seemingly disparate models can be seamlessly integrated into a single atomic graph structure by means of shared vertices. For instance, in the associated diagram, Gremlin is a Titan dependency, Titan is developed by Matthias, and Matthias writes messages on Aurelius’ mailing list (software merges with discussion). Next, Blueprints is a Titan dependency and Titan is tagged graph (software merges with concept). The dotted lines identify other such cross-domain linkages that demonstrate how a universal model is created when vertices are shared across domains. The integrated, universal model can be subjected to processes that provide richer (perhaps, more intelligent) services than what any individual model could provide alone.

Process: Solving Real-World Problems with Traversals

vertex-edge-traversal What has been presented thus far is a single graph model of a set of interrelated domains. A model is only useful if there are processes that can leverage it to solve problems. Much like data needs algorithms, a graph needs a traversal. A traversal is an algorithmic/directed walk over the graph such that paths are determined (called derivations) or information is gleaned (called statistics). Even the human visual system viewing a graph visualization is a traversal engine leveraging saccadic movements to identify patterns. However, as graphs grow large and problems demand precise logic, visualizations and the human’s internal calculator break down. A collection of traversal examples are presented next that solve typical problems in the previously discussed domains.

Determining Circular Dependencies

With the growth of open source software and the ease by which modules can be incorporated into projects, circular dependencies abound and can lead to problems in software engineering. A circular dependency occurs when project A depends on project B and, through some dependency path, project B depends on project A. When dependencies are represented graphically, a traversal can easily identify such circularities (e.g. in the diagram below, A->B->D->G->A is a cycle).

circular-dependency

Ranking Discussion Contributors

Mailing lists are composed of individuals with varying levels of participation and competence. When a mailing list is focused on learning through discussion, simply writing a message is not necessarily a sign of positive contribution. If an author’s messages spawn replies, then it can be interpreted that the author is contributing discussion-worthy material. However, if an author’s messages end the conversation, then they may be contributing non-sequiturs or information that is not allowing the discussion to flourish. In the associated diagram, the beige vertices are authors and their respective number is a unique author id.

discussion-contribution

discussion-contribution-ranks One way to rank contributors on a mailing list is to count the number of messages they have posted (the author’s out-degree to messages in the mailing list). However, if the ranking must account for fruitful contributions, then authors can be ranked by the depth of the discussion their messages spawn (the tree depth of the author’s messages). Finally, note that other techniques such as sentiment and concept analysis can be included in order to understand the itention and meaning of a message.

Finding Related Concepts

Stephen’s understanding of graphs was developed while working on TinkerPop’s graph technology stack. Nowadays he is interested in learning more about the theoretical aspects of graphs. Via his web browser, he visits the graph Wikipedia page. In a manual fashion, Stephen clicks links and reads articles — depth-first, graph traversals, adjacency lists, etc. He realizes that pages reference each other and that some concepts are more related to others due to Wikipedia’s link structure. The manual process of walking links can be automated using a graph traversal. Instead of clicking, a traversal can start at the graph vertex, emanate outwards, and report which concepts have been touched the most. The concept that has seen the most flow, is a concept that has many ties (i.e. paths) to graph (see priors algorithms). With such a traversal, Stephen can be provided a ranked list of graph related concepts. This traversal is analogous to a wave diffusing over a body of water — albeit real-world graph topologies are rarely as simple as a two-dimensional plane (see lattice).

A Multi-Domain Traversal

The different graph models discussed previously (i.e. software, discussion, and concept) were integrated into a single world model via shared vertices. Analogously, the aforementioned graph traversals can be composed to yield a solution to a cross-domain problem. For example:

“Recommend me projects to participate in that maintain a proper dependency structure, have engaging contributors promoting the space, and are conceptually related to technologies I’ve worked on previously.”

This type of problem solving is possible when a heterogenous network of things is linked together and effectively moved within. The means of linking and moving is the graph and the traversal, respectively. To conclude this section, other useful traversal examples are provided.

“Compute a ‘stability rank’ for a project based on the number of issues it has and the number of issues its dependencies have, so forth and so on in a recursive manner.”

“Cluster projects according to shared (or similar) concepts between them.”

“Recommend a team of developers for an upcoming project that will use X dependencies and is related to Y concepts.”

“Rank issues by the number of projects that each issue’s submitter has contributed to.”

Graph Computing Technologies

The practice of computing is about riding the fine line between two entangled quantities: space and time. In the world of graph computing, the same tradeoffs exist. This section will discuss various graph technologies in order to identify what is gained and sacrificed with each choice. Moreover, a few example technologies are presented. Note that many more technologies exist and the mentioned examples are by no means exhaustive.

In-Memory Graph Toolkits

In-memory graph toolkits are single-user systems that are oriented towards graph analysis and visualization. They usually provide implementations of the numerous graph algorithms defined in the graph theory and network science literature (see Wikipedia’s list of graph algorithms). The limiting factor of these tools is that they can only operate on graphs that can be stored in local, main memory. While this can be large (millions of edges), it is not always sufficient. If the source graph data set is too large to fit into main memory, then subsets are typically isolated and processed using such in-memory graph toolkits.

Examples: JUNG, NetworkX, iGraph, Fulgora (coming soon)

  • [+] Rich graph algorithm libraries
  • [+] Rich graph visualization libraries
  • [+] Different memory representations for different space/time tradeoffs
  • [-] Constrained to graphs that can fit into main memory
  • [-] Interaction is normally very code heavy

Real-Time Graph Databases

Graph databases are perhaps the most popular incarnation of a graph computing technology. They provide transactional semantics such as ACID (typical of local databases) and eventual consistency (typical of distributed databases). Unlike in-memory graph toolkits, graph databases make use of the disk to persist the graph. On reasonable machines, local graph databases can support a couple billion edges while distributed systems can handle hundreds of billions of edges. At this scale and with multi-user concurrency, where random access to disk and memory are at play, global graph algorithms are not feasible. What is feasible is local graph algorithms/traversals. Instead of traversing the entire graph, some set of vertices serve as the source (or root) of the traversal.

Examples: Neo4j, OrientDB, InfiniteGraph, DEX, Titan

  • [+] Optimized for local neighborhood analyses (“ego-centric” traversals)
  • [+] Optimized for handling numerous concurrent users
  • [+] Interactions are via graph-oriented query/traversal languages
  • [-] Global graph analytics are inefficient due to random disk interactions
  • [-] Large computational overhead due to database functionality (e.g. transactional semantics)

Batch Processing Graph Frameworks

Batch processing graph frameworks make use of a compute cluster. Most of the popular frameworks in this space leverage Hadoop for storage (HDFS) and processing (MapReduce). These systems are oriented towards global analytics. That is, computations that touch the entire graph dataset and, in many instances, touch the entire graph many times over (iterative algorithms). Such analyses do not run in real-time. However, because they perform global scans of the data, they can leverage sequential reads from disk (see The Pathology of Big Data). Finally, like the in-memory systems, they are oriented towards the data scientist or, in a production setting, for feeding results back into a real-time graph database.

Examples: Hama, Giraph, GraphLab, Faunus

  • [+] Optimized for global graph analytics
  • [+] Process graphs represented across a machine cluster
  • [+] Leverages sequential access to disk for fast read times
  • [-] Does not support multiple concurrent users
  • [-] Are not real-time graph computing systems

This section presented different graph computing solutions. It is important to note that there also exists hardware solutions like Convey’s MX Series and Cray’s YARC graph engines. Each of the technologies discussed all share one important theme — they are focused on processing graph data. The tradeoffs of each category are determined by the limits set forth by modern hardware/software and, ultimately, theoretical computer science.

Conclusion

To the adept, graph computing is not only a set of technologies, but a way of thinking about the world in terms of graphs and the processes therein in terms of traversals. As data is becoming more accessible, it is easier to build richer models of the environment. What is becoming more difficult is storing that data in a form that can be conveniently and efficiently processed by different computing systems. There are many situations in which graphs are a natural foundation for modeling. When a model is a graph, then the numerous graph computing technologies can be applied to it.

Acknowledgement

Mike Loukides of O’Reilly was kind enough to review multiple versions of this article and in doing so, made the article all the better.

Exploring Wikipedia with Gremlin Graph Traversals

A graph is a data structure composed of vertices and edges. These terms are synonymous with the following:

  • vertices, dots, nodes, things, points
  • edges, lines, links, relations, arcs

From vertices and edges, numerous artificial and real-world systems can be modeled and processed. The purpose of this post is to explore a graph representation of Wikipedia persisted in the graph database Neo4j and processed using the graph traversal language Gremlin. Please note that a parse-friendly representation of Wikipedia was provided by DBPedia and the present analysis and presentation is provided by a collaborative effort between Aurelius and the Intelligent Systems Laboratory at PARC (a Xerox company).

On the Graph Nature of Wikipedia

There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

Using Gremlin, some descriptive graph statistics of this representation are calculated. Along with vertex and edge counts, the in- and out-degree distributions of the graph are provided below. These distribution calculations are agnostic to the edge labels.

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = new Neo4jGraph('/data/dbpedia')
==>neo4jgraph[/data/dbpedia]
gremlin> g.V.count()
==>30962172
gremlin> g.E.count()
==>191767951 
gremlin> outDegrees = [:]; inDegrees = [:]
gremlin> g.V.transform{it.out.count()}.groupCount(outDegrees).iterate()
gremlin> g.V.transform{it.in.count()}.groupCount(inDegrees).iterate()
gremlin> f = new File('out-degrees.txt')
gremlin> outDegrees.each{k,v -> f.append(k + '\t' + v + '\n')}
gremlin> f = new File('in-degrees.txt')
gremlin> inDegrees.each{k,v -> f.append(k + '\t' + v + '\n')}

The files out-degrees.txt and in-degrees.txt generated by Gremlin are loaded into R Statistics and plotted.

R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing

> out.degrees <- read.table('out-degrees.txt', sep='\t')
> in.degrees <- read.table('in-degrees.txt', sep='\t')
> plot(out.degrees, log='xy', main='Wikipedia Out-Degree Distribution', xlab='Out Degree', ylab='Frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
> plot(in.degrees, log='xy', main='Wikipedia In-Degree Distribution', xlab='In Degree', ylab='Frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

On the Referential Nature of Ideas

One lazy Sunday afternoon, Gremlin walks into his library. The smell of leather and mahogany overcomes his senses. His eyes scan over the shelves filled with books about every imaginable topic. On one inspired day many moons ago, Gremlin printed out each Wikipedia article, bound it, and stored it. On this particular occasion, Gremlin decides to review one of his favorite subjects: graph theory. Off the shelf, Gremlin grabs the book respectfully entitled “Graph Theory.”

gremlin> v = g.idx(T.v)[[uri:'http://dbpedia.org/resource/Graph_theory']].next()
==>v[26634]
gremlin> v.map()
==>name=graph theory
==>uri=http://dbpedia.org/resource/Graph_theory

Gremlin cracks open the cover of Graph Theory. To his amazement, he notices that the words in the book reference other books on his shelf! [line 1] He wonders, how many such references exist? [line 7]

gremlin> v.out('href').name[0..4]
==>mathematics
==>computer science
==>graph (mathematics)
==>vertex (graph theory)
==>graph of a function
gremlin> v.out('href').count()
==>240

One reference at a time, Gremlin pulls the respective book from its shelf and looks at its pages. Unbelievably, these books also make reference to other books! [line 1] How many references do the references of Graph Theory have? [line 8] Dumbfounded by the circuitous nature of ideas being defined in terms of other ideas, Gremlin comes to realize the inherent structure of the graph of knowledge — its self-referential nature [line 9].

gremlin> v.out('href').out('href').name[0..4]
==>mathematics (disambiguation)
==>math (disambiguation)
==>euclid
==>greek language
==>quantity
gremlin> v.out('href').out('href').count()   
==>21182
gremlin> v.out('href').dedup.out('href').out('href').retain([v]).paths{it.name}[0..4]
==>[graph theory, mathematics, logic, graph theory]
==>[graph theory, mathematics, history of mathematics, graph theory]
==>[graph theory, mathematics, operations research, graph theory]
==>[graph theory, mathematics, paul erdős, graph theory]
==>[graph theory, mathematics, theoretical computer science, graph theory]

There is only so much time a little green Gremlin gets to live. He wonders what he would be expert in if he continues to read book after book as dictated by the references that they maintain to one another.

gremlin> v.in('href').name[0..4]
==>semiotics of the structure
==>sones graphdb
==>trapezoid graph
==>alpha centrality
==>block graph

On the Recurrent Nature of Thought

For hours upon hours and days upon days, Gremlin goes from book to book. Sometimes coming back to reading a previous book over again. From the vantage point of a long exposure film, his form looks like a blur amongst his collection. However, this blur is not evenly distributed across his library. In fact, he is more likely reading some books over others. Those books that are most central in the hyperlink structure are those most likely to be encountered on Gremlin’s random walk. In his ledger, Gremlin decides to record the potential that he will be reading any one book at any one time. Initially all pages are equally likely. [line 1] He begins his journey at Graph Theory. [line 2] He looks up how much “potential” exists for Graph Theory. [line 3] He then retrieves all the outgoing neighbors of Graph Theory. [line 4] He determines the total number of neighbors. [line 5] To each neighbor, he distributes Graph Theory’s potential equally amongst them. [lines 6--8] Gremlin then jumps to Graph Theory’s referred to books and continues the process over and over and over again for 100,000 steps. Starting from the Graph Theory vertex, Gremlin diffuses himself like a wave over the structure that surrounds.

gremlin> m = [:].withDefault{1}
gremlin> v.transform{
  rank = m[it.name]; 
  neighbors = it.out('href').toList();
  degree = neighbors.size();
  neighbors.each {
    m[it.name] = m[it.name] + (rank/degree);
  }
  neighbors;
}.scatter.range(0,100000).loop(3){true}.iterate()
==>null

Gremlin looks over the top 10 ranked entries in his ledger. This is where his diffusion shows the most recurrence — it articulates which book he is most likely to be reading at any one moment in the next 100,000 book reads.

gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>mathematics=8.2325984930913507177010807338237871640327644591
==>graph theory=6.3579261869065686025097269665928272601625649213
==>graph (mathematics)=3.8056391897097907459393869524402476473212892533
==>computer science=3.2189950001323052861187247709061777261390151631
==>real number=2.942860316400352223267769366106131293745859000
==>mathematician=2.751409475433875926984889918863144276736916
==>number theory=2.7296994022747017139496107107489077766999403538
==>combinatorics=2.6983273156966484352429094611889032326648909856
==>topology=2.6822253369205954921236082679389161746453858919
==>set theory=2.5682887400468326671894663603540368070624776875

On the Taxonomical Nature of Categories

Gremlin’s library is not just shelves with books. His library also has a card catalog in it. The catalog indexes books by category. Moreover, each book has a reference to its subject category. For instance, the subject category of Graph Theory is, well…’graph theory.’ The catalog is also structured such that the categories make reference to one another. For instance, ‘combinatorics’ is a broader category than ‘graph theory.’ For the fun of it, Gremlin decides to move between the books and the categories.

gremlin> v.out('subject').name
==>graph theory
gremlin> v.out('subject').out('broader').name
==>combinatorics
==>mathematics

In order to see the layers in a breadth-first manner, Gremlin exposes the category “rings” emanating from ‘graph theory.’

gremlin> x = [];
gremlin> v.out('subject').out('broader').gather{t = it.unique(); x.add(t*.name); t}.scatter.loop(3){it.loops < 7}.iterate()
==>null
gremlin> x
==>[combinatorics, mathematical relations]
==>[subdivisions of mathematics, discrete mathematics, mathematics, predicate logic]
==>[mathematics, subfields by academic discipline, subdivisions of mathematics, main topic classifications, abstraction, formal sciences, structure, scientific disciplines, systems of formal logic]
==>[main topic classifications, abstraction, formal sciences, structure, scientific disciplines, categories by topic, academic disciplines, mathematics, subfields by academic discipline, articles, thought, innovation, problem solving, creativity, analogy, concepts, logic, interdisciplinary fields, formalism (deductive), matter, dimension, form, science, mathematical logic, formal systems]
==>[articles, thought, structure, innovation, problem solving, creativity, analogy, concepts, academic disciplines, logic, interdisciplinary fields, scientific disciplines, formalism (deductive), matter, dimension, form, science, subfields by academic discipline, categories by parameter, academia, categories by topic, main topic classifications, abstraction, formal sciences, contents, mind, cognition, mental processes, mental content, technological change, intelligence, critical thinking, human skills, human behavior, action, quality, fundamental categories, branches of philosophy, axiology, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, subdivisions of mathematics, discrete mathematics, mathematical logic, metalogic, formal theories, logical syntax]
==>[contents, mind, cognition, mental processes, mental content, matter, concepts, dimension, form, technological change, intelligence, creativity, innovation, critical thinking, human skills, human behavior, action, quality, thought, fundamental categories, academia, categories by topic, abstraction, branches of philosophy, interdisciplinary fields, axiology, science, academic disciplines, subfields by academic discipline, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, main topic classifications, categories, education, categories by parameter, articles, structure, problem solving, analogy, logic, scientific disciplines, formalism (deductive), cognitive science, concepts in metaphysics, philosophy of mind, consciousness, epistemology, psychology, technology, economic development, technological change, and growth, neurology, neuroscience, memory, learning, educational psychology, philosophical logic, evaluation, life skills, skills, humans, behavior, phenomena, determinism, free will, motion, management, philosophy by field, value, deduction, scientific theories, philosophical theories, philosophy of mathematics, physics, concepts by field, physical cosmology, astrophysics, astronomical dynamical systems, reality, everything, life, metaphysics, philosophy of science, topological spaces, differential geometry, geometric topology, measurement, information, perception, concepts in epistemology, mathematics, subdivisions of mathematics, discrete mathematics, metaphilosophy, metatheory, theories, formal languages, formal sciences, syntax]

These rings are useful for seeing what is touched at each step of the walk. However, in order to see the inverted tree structure, Gremlin nests his traversal so that the categories are grouped under their children. Gremlin records the subsumption hierarchy as categories broaden one another.

gremlin> v.out('subject').out('broader').groupBy{it.name}{
  it.out('broader').groupBy{it.name}{
    it.out('broader').name
  }.cap.next()
}.cap.next()
==>mathematical relations=[{mathematics=[main topic classifications, abstraction, formal sciences, structure, scientific disciplines], predicate logic=[systems of formal logic]}]
==>combinatorics=[{discrete mathematics=[subdivisions of mathematics], subdivisions of mathematics=[mathematics, subfields by academic discipline]}]

Gremlin’s library is an adventure through the fascinating world of interconnected ideas.

References

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, S., Cyganiak, R., Hellmann, S., “DBpedia – A Crystallization Point for the Web of Data,” Journal of Web Semantics, 7(3), 2009.

Rodriguez, M.A., Pepe, A., Shinavier, J., “The Dilated Triple,” Emergent Web Intelligence: Advanced Semantic Technologies, Advanced Information and Knowledge Processing series, 3–16, 2010.

Newman, M.E.J., “The Structure and Function of Complex Networks“, SIAM Review, 45, 167–256, 2003.

Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L.M.A, Chute, R., Rodriguez, M.A., Balakireva, L.L., “Clickstream Data Yields High-Resolution Maps of Science,” PLoS One, Public Library of Science, 4(3), e4803, 2009.

A TinkerPop Story

In a time long, long right now and a place far, far within, there exists a little green gremlin named…well, Gremlin. Gremlin lives in a place known as TinkerPop. For those who think of a “place” as some terrestrial surface coating a sphere that is circling one of the many massive fiery nuclear reactors in the known universe, TinkerPop is that, yet at the same time, a wholly different type of place indeed.

The unusual aspect of TinkerPop is that every possible “thing” is related to every possible “thing” in every conceivable way possible. For instance, the delightfully unpleasant Tom loves his hate for the disgustingly adorable Jerry. Even purely conceptual things enjoy an arbitrary existence: left is to the right of east and up is both beside and within down. Conscious things can experience a tropical island bench while drinking hop cocoa in font of a log burying fire.

Gremlin realizes that a logic-less land is a canvas asking to be constrained by any being wishing to create meaning. With this notion in his mind, Gremlin decides to come to terms with the TinkerPop. The little green guy, being a machine elf, is a master engineer. Literally, Gremlin is great at making friends. He goes to work constructing Blueprints. Sparing no detail too small, Gremlin tells Blueprints which aspects of the TinkerPop he decides are “true.” Blueprints diligently saves this subgraph of the TinkerPop into his underlying database.

Graph graph = new TinkerGraph();
Vertex tom = graph.addVertex(1);
Vertex jerry = graph.addVertex(2);
graph.addEdge(3, tom, jerry, "hates");

Day in and day out, Gremlin spends his time reciting to Blueprints a story like no other, where the characters and their relationships are selectively picked from the infinity that is the TinkerPop.

Bugs Bunny is one “wascally wabbit.” Basically, Bugs is about as cool as it gets. His colleague Daffy thinks he is smarter and cooler, but really he isn’t.

Overtime, it becomes difficult for Gremlin to keep track of his ontology: “Smurfs can’t breath underwater, but Snorks can…and aren’t Smurfs Snorks?!” Gremlin decides to make a friend in Frames so that Frames can help him with his story’s existential crises. Frames is programmed with a set of axioms and a set of inferences. These suffuse a consistency over his world and crystallize a metaphysics so fundamental that his characters realize no other way but the truth that Gremlin has chosen.

public interface Cat {
  @Relation(label="enemy")
  public void addEnemy(Mouse mouse);
  @Relation(label="enemy")
  public Collection getEnemies();
}

FramesManager manager = new FramesManager(graph);
Cat tom = manager.frame(graph.getVertex(1), Cat.class);
Mouse jerry = manager.frame(graph.getVertex(2), Mouse.class);
tom.addEnemy(jerry);

One day Gremlin asks Blueprints to recite back to him the story thus far.

for(Edge edge: graph.getEdges()) {
  System.out.println(edge);
}

Blueprints, in a single moment, tells of Chilly Willy, Chilly’s North Pole home, his many “ah ee, a choo!“s, that damn Woody Woodpecker, and all the other creations in Gremlin’s story. To Gremlin’s distress, the world all at once is too much for him to bear. There is no flow, no movement, simply an iterated list of unordered facts. In a stroke of genius, he heads to his shop and fabrics Pipes. Pipes’ job is to ensure that Blueprints streams Gremlin’s story in a lazy, on demand, fashion. Through Pipes, Gremlin can experience the lives of his characters, their friendships, their passions, their ups and downs — a small drama within the TinkerPop washes through Gremlin’s being and through the minds of his characters.

Pipes, would you please fetch the names of Tom’s enemies that start with the letter ‘J’?

new GremlinPipeline(graph.getVertex(1)).out("hates").property("name").filter(new PipeFunction() {
  public Boolean compute(String name) {
    return name.startsWith("J");
  }
}

Pipes and Gremlin form a close bond. Together they craft a shorthand slang for communicating. In fact, they create various mini-languages each with their benefits and drawbacks.

g.v(1).out('hates').name.filter{it.startsWith('J')}

As Gremlin experiences his story unfold, he has a sense that perhaps there are patterns in his madness. All the facts that he has told Blueprints, via Frames, and have come to re-experience through Pipes are not random in the slightest. There exists structure and general laws. To prove his hypothesis, Gremlin goes about designing and building Furnace. Once up and running, Furnace is put to work distilling the statistical properties of Gremlin’s subset of the TinkerPop.

Who are the most central Smurfs in the Smurf’s friendship graph?
Do enemies in the graph tend to be of the same gender?
Do all economic pathways lead to Scrooge McDuck?
Why do people who like Rainbow Bright also like My Little Pony?

Saddened by the seeming barrier between himself and his creation, Gremlin wonders if the characters within his story can come to realize his role in their world. Can the objects in the virtual machine reflect upon the machine that yields their existence? As his final creation, Gremlin constructs his trusty, loyal friend Rexster. Rexster spends his nights howling into the TinkerPop, hoping for someone to hear him — for someone to realize that the world they live in is but a function of Gremlin’s selective path through the infinity that is the TinkerPop. Hopefully someone will get a reference — maybe someone will make contact…and maybe someone else will ultimately realize the TinkerPop.


http://localhost:8182/graphs/tinkergraph/vertex/1

A Graph-Based Movie Recommender Engine

A recommender engine helps a user find novel and interesting items within a pool of resources. There are numerous types of recommendation algorithms and a graph can serve as a general-purpose substrate for evaluating such algorithms. This post will demonstrate how to build a graph-based movie recommender engine using the publicly available MovieLens dataset, the graph database Neo4j, and the graph traversal language Gremlin. Feel free to follow along in the Gremlin console as the post will go step-by-step from data acquisition, to parsing, and ultimately, to traversing.

The MovieRatings Dataset

The GroupLens research group has made available a corpus of movie ratings. There are 3 versions of this dataset: 100 thousand, 1 million, and 10 million ratings. This post makes use of the 1 million ratings version of the dataset. The dataset can be downloaded from the MovieRatings website (~6 megs in size). The raw dataset is composed of three files: users.dat, movies.dat, and ratings.dat. More information about the particulars of each file is described in the respective README.txt.

Getting Started with Gremlin

All of the code examples can be cut and pasted into the Gremlin console or into a Groovy/Java class within a larger application. For the sake of simplicity, simply follow along with the Gremlin console. Gremlin 1.3 is available for download at this location.

marko$ ./gremlin.sh

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> 

Generating a MovieRatings Graph

Before getting recommendations of which movies to watch, it is important to first parse the raw MovieLens data according to the graph schema defined above. While this is not the only way in which the data could be represented as a property graph, the representation above is natural and useful for the traversals presented later.

The data will be inserted into the graph database Neo4j. The Gremlin/Groovy code below creates a new Neo4j graph, removes an unneeded default edge index, and sets the transaction buffer to 1000 mutations per commit. Finally, to make the data easier to understand, a map is hardcoded which will allow for the representation of a user’s occupation as a string instead of its integer id.

g = new Neo4jGraph('/tmp/movieRatings')
g.dropIndex("edges")
g.setMaxBufferSize(1000)

occupations = [0:'other', 1:'academic/educator', 2:'artist',
  3:'clerical/admin', 4:'college/grad student', 5:'customer service',
  6:'doctor/health care', 7:'executive/managerial', 8:'farmer',
  9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer',
  13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed',
  17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer'] 

Parsing Movie Data

The file movie.dat contains a list of movies. Each row of the raw file has 3 columns: movieId, title, and generas.

marko$ more -n6 movies.dat 
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy

The code to parse this data into Neo4j and according to the diagrammed schema is presented below. The details of the code are left to focused eye of the interested reader. For others, simply cutting and pasting this code into the Gremlin console will suffice to move forward.

new File('movies.dat').eachLine {def line ->
  def components = line.split('::');
  def movieVertex = g.addVertex(['type':'Movie', 'movieId':components[0].toInteger(), 'title':components[1]]);
  components[2].split('\\|').each { def genera ->
    def hits = g.idx(T.v)[[genera:genera]].iterator();
    def generaVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Genera', 'genera':genera]);
    g.addEdge(movieVertex, generaVertex, 'hasGenera');
  }
}

Parsing User Data

The file users.dat contains a list of users. Each row of the raw file has 5 columns: userId, gender, age, occupation, and zipcode. For schema simplicity, the zipcode field will be ignored.

ml-1m$ more -n6 users.dat 
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
new File('users.dat').eachLine {def line ->
  def components = line.split('::');
  def userVertex = g.addVertex(['type':'User', 'userId':components[0].toInteger(), 'gender':components[1], 'age':components[2].toInteger()]);
  def occupation = occupations[components[3].toInteger()];
  def hits = g.idx(T.v)[[occupation:occupation]].iterator();
  def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Occupation', 'occupation':occupation]);
  g.addEdge(userVertex, occupationVertex, 'hasOccupation');
}

Parsing Ratings Data

The file ratings.dat contains the ratings that a user provided for the movies they watched. Each row of the raw file has 4 columns: userId, movieId, stars (1-5 rating scale), and timestamp. The timestamp field will be ignored.

ml-1m$ more -n6 ratings.dat 
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

Given that there are 1 million ratings, this code fragment will take a minute or so to evaluate. For those dealing with large datasets and Neo4j, its possible to use Neo4jBatchGraph which is new with Blueprints 1.0.

new File('ratings.dat').eachLine {def line ->
  def components = line.split('::');
  def ratedEdge = g.addEdge(g.idx(T.v)[[userId:components[0].toInteger()]].next(), g.idx(T.v)[[movieId:components[1].toInteger()]].next(), 'rated');
  ratedEdge.setProperty('stars', components[2].toInteger());
}

To commit any data left over in the transaction buffer, successfully stop the current transaction. Now the data is persisted to disk. If you plan on leaving the Gremlin console, be sure to g.shutdown() the graph first.

g.stopTransaction(TransactionalGraph.Conclusion.SUCCESS)

Before moving onto recommender algorithms, lets make sure the graph looks correct.

gremlin> g.V.count()
==>9962
gremlin> g.E.count()
==>1012657
gremlin> g.V[[type:'Movie']].count()
==>3883
gremlin> g.V[[type:'Genera']].count()
==>18
gremlin> g.V[[type:'User']].count()  
==>6040
gremlin> g.V[[type:'Occupation']].count()
==>21
gremlin> occupations.size()
==>21

As a side: What is the distribution of occupations amongst the user population?

gremlin> m = [:]           
gremlin> g.V[[type:'User']].out('hasOccupation').occupation.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}
==>college/grad student=759
==>other=711
==>executive/managerial=679
==>academic/educator=528
==>technician/engineer=502
==>programmer=388
==>sales/marketing=302
==>writer=281
==>artist=267
==>self-employed=241
==>doctor/health care=236
==>K-12 student=195
==>clerical/admin=173
==>scientist=144
==>retired=142
==>lawyer=129
==>customer service=112
==>homemaker=92
==>unemployed=72
==>tradesman/craftsman=70
==>farmer=17

What about the average age?

gremlin> g.V[[type:'User']].age.mean()                                         
==>30.639238410596025

Traversing the MovieLens Graph

Now that the MovieLens data has been parsed and represented as a graph, the data is ready to be traversed (i.e. queried). There are two general types of recommendation: collaborative filtering and content-based recommendation.

In collaborative filtering, the rating/liking/preference behavior of users is correlated in order to recommend the favorites of one user to another, similar user.

I like the movies you like, what other movies do you like that I haven’t seen?

With content-based recommendation, if a particular item is liked, then its features are analyzed in order to find other items with analogous features.

I like romantic zombie comedies, what other romantic zombie comedies are there?

Basic Collaborative Filtering

This section will build up an ever more complex traversal in order to explain the foundation of graph-based collaborative filtering and hint at the variations that can be incorporated to meet domain specific requirements. Lets start from Toy Story the movie.

gremlin> v = g.idx(T.v)[[title:'Toy Story (1995)']] >> 1                                             
==>v[1]
gremlin> v.map()                                        
==>movieId=1
==>title=Toy Story (1995)
==>type=Movie
  • Which users gave Toy Story more than 3 stars? (only return 5 results)
gremlin> v.inE('rated').filter{it.getProperty('stars') > 3}.outV.userId[0..4] 
==>v[3902]
==>v[3912]
==>v[3916]
==>v[3918]
==>v[3920]

This traversal doesn’t yield useful information. However, when it is used within a larger path expression, collaborative filtering is effected.

  • Which users gave Toy Story more than 3 stars and what other movies did they give more than 3 stars to? (only return 5 results)
gremlin> v.inE('rated').filter{it.getProperty('stars') > 3}.outV.outE('rated').filter{it.getProperty('stars') > 3}.inV.title[0..4]
==>One Flew Over the Cuckoo's Nest (1975)
==>Erin Brockovich (2000)
==>Bug's Life, A (1998)
==>Ben-Hur (1959)
==>Christmas Story, A (1983)

  1. Start from Toy Story — v
  2. Get the incoming rated edges — inE(‘rated’)
  3. Filter out those edges whose star property is less than 4 — filter{it.getProperty(‘stars’) > 3}
  4. Get the tail user vertices of the remaining edges — outV
  5. Get the rating edges of those user vertices — outE(‘rated’)
  6. Filter out those edges whose star property is less than 4 — filter{it.getProperty(‘stars’) > 3}
  7. Get the head movie vertices of the remaining edges — inV
  8. Get the string title property of those movie vertices — title
  9. Only emit the first 5 titles — [0..4]

What is the meaning of the aggregate of these atomic-steps? — highly co-rated. With Gremlin, it is possible to work at a higher level of abstraction by making use of a user defined steps. The step defined below is called corated and it bundles all these atomic-steps together.

gremlin> Gremlin.defineStep('corated',[Vertex,Pipe], { def stars ->
  _().inE('rated').filter{it.getProperty('stars') > stars}.outV.outE('rated').filter{it.getProperty('stars') > stars}.inV})
==>null
gremlin> v.corated(3).title[0..4]
==>One Flew Over the Cuckoo's Nest (1975)
==>Erin Brockovich (2000)
==>Bug's Life, A (1998)
==>Ben-Hur (1959)
==>Christmas Story, A (1983)

This traversal will return a list of movies. Notice that this list has many duplicates. This is due to the fact that users who like Toy Story also like many of the same other movies. This similarity of human behavior is what is capitalized on in recommendation algorithms.

  • How many of Toy Story’s highly co-rated movies are unique?
gremlin> v.corated(3).count()      
==>268493
gremlin> v.corated(3).uniqueObject.count()
==>3353

Given that there are 268,493 highly rated paths from Toy Story to other movies and only 3,353 of those movies are unique, it is possible to use these duplicates as a ranking mechanism–ultimately, a recommendation.

  • Which movies are most highly co-rated with Toy Story? (return the top 10)
gremlin> m = [:]                                                                                                  
gremlin> v.corated(3).title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] 
==>Toy Story (1995)=1655
==>Star Wars: Episode V - The Empire Strikes Back (1980)=1000
==>Star Wars: Episode IV - A New Hope (1977)=998
==>American Beauty (1999)=949
==>Matrix, The (1999)=925
==>Raiders of the Lost Ark (1981)=922
==>Silence of the Lambs, The (1991)=887
==>Saving Private Ryan (1998)=878
==>Back to the Future (1985)=876
==>Shawshank Redemption, The (1994)=875

The highest ranked result makes sense. It says, people who like Toy Story, also like Toy Story. In order to remove these reflexive paths, simply filter out Toy Story.

gremlin> m = [:]                                                                                                  
gremlin> v.corated(3).filter{it != v}.title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] 
==>Star Wars: Episode V - The Empire Strikes Back (1980)=1000
==>Star Wars: Episode IV - A New Hope (1977)=998
==>American Beauty (1999)=949
==>Matrix, The (1999)=925
==>Raiders of the Lost Ark (1981)=922
==>Silence of the Lambs, The (1991)=887
==>Saving Private Ryan (1998)=878
==>Back to the Future (1985)=876
==>Shawshank Redemption, The (1994)=875
==>Toy Story 2 (1999)=871

These are all fine movies and ones that are generally considered “good.” However, note that the recommendation traversal starts not from a particular user, but from a particular movie (i.e. Toy Story). If instead the traversal starts from a particular user, then basic collaborative filtering is implemented.

Given a user, what movies do they like, who else likes those movies, and what other movies do those users like that are not already liked by that initial user.

Expressing this in Gremlin is left as an exercise to the more than casually interested reader. Feel free to email me your solution for comments (or post it to the Gremlin-Users mailing list).

Mixing in Content-Based Recommendation

The previous section returned generally good movies. However, if I’m a parent and I want to find a good movie like Toy Story for my children to watch, I probably do not want them watching Silence of the Lambs. To remedy this situation, it is possible to mix collaborative filtering and content-based recommendation into a traversal that yields “Toy Story good” movies of the same genera.

gremlin> v.out('hasGenera').genera
==>Animation
==>Children's
==>Comedy
  • Which movies are most highly co-rated with Toy Story that share a genera with Toy Story? (return the top 10)
gremlin> m = [:]
gremlin> x = [] as Set
gremlin> v.out('hasGenera').aggregate(x).back(2).corated(3).filter{it != v}.out('hasGenera').retain(x).back(2).title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>American Beauty (1999)=949
==>Back to the Future (1985)=876
==>Toy Story 2 (1999)=871
==>Princess Bride, The (1987)=851
==>Groundhog Day (1993)=843
==>Shakespeare in Love (1998)=807
==>Forrest Gump (1994)=775
==>Men in Black (1997)=747
==>E.T. the Extra-Terrestrial (1982)=737
==>Bug's Life, A (1998)=700

This ranking makes sense, but still, for the stiflingly worrisome parent, movies like American Beauty (classified as a comedy) may not be considered appropriate for children. How about only considering those movies that share all the same generas with Toy Story?

  • Which movies are most highly co-rated with Toy Story that share all generas with Toy Story? (return the top 10)
gremlin> m = [:] 
gremlin> x = [] as Set                                                                                                                                    
gremlin> v.out('hasGenera').aggregate(x).back(2).corated(3).filter{it != v}.filter{it.out('hasGenera')>>[] as Set == x}.title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]                                                                                                    
==>Toy Story 2 (1999)=871
==>Bug's Life, A (1998)=700
==>Chicken Run (2000)=465
==>American Tail, An (1986)=140
==>Aladdin and the King of Thieves (1996)=39
==>American Tail: Fievel Goes West, An (1991)=37
==>Rugrats Movie, The (1998)=34
==>Adventures of Rocky and Bullwinkle, The (2000)=21
==>Saludos Amigos (1943)=4

Conclusion

Recommendation is a popular technique used for separating the wheat from the chaff. For systems with numerous items, helping a user find items that not only satiate their tastes, but also their momentary needs, the algorithms presented can be used. It is important to note that at some level of abstraction, collaborative filtering and content-based recommendation are the same. This idea is made salient when considering that ratings are yet another feature of an item. More specifically, highly co-rated movies of a movie are features of that movie. This idea gets to the flexibility of the property graph data structure and the notion of abstract/inferred/derived relationships.

The itemization below presents recommendation themes that can be further explored by the interested reader. The graph allows an analyst to slice and dice the data in various ways. With enough data (many types of things and many types of relationships), there are numerous mappings that can be made. That is the power of the graph traversal pattern.

  • Make use of genera taxonomies to explore genera generalization and specification.
  • Determine if gender is a significant factor in the preference of movies.
  • Recommend movies with filtering based on the occupation (or age) of the user.
  • Mix in other movie features such as director, actors, and motion picture ratings.
  • Makes use of a collection of start movies (e.g. “Given Toy Story and Big Trouble in Little China, …”). Realize that a single user can be seen as a reference to such a collection (the movies they highly rated).
  • “What do the users prefer who don’t like the same movies as me?”
  • “What do the users dislike who like the movies I dislike?” (non-collaborative filtering)
  • Use gender, zipcode, and ratings to recommend users a movie watching date (“watch movie X with user Y“).
  • Use range filters (e.g. [0..100]) or random walks for fined grained control of the execution speed (i.e. path sampling).

Related Articles

Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,” Proceedings of the International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Lecture Notes in Artificial Intelligence, volume 5712, pp. 813–820, April 2009.

Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques and Applications, eds. S. Sakr, E. Pardede, IGI Global, ISBN:9781613500538, August 2011.

Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support the Scholarly Communication Process,” KRS-2009-02, April 2009.

On the Nature of Pipes

Pipes is a data flow framework developed by TinkerPop. The graph traversal language Gremlin is a Groovy-based domain-specific language for processing Blueprints-enabled graph databases with Pipes. Since the release of Pipes 0.7 on August 1, 2011, much of the functionality in Gremlin has been generalized and made available through Pipes. This has opened up the door for other JVM languages (e.g. JRuby, Jython, Clojure, etc.) to serve as host languages for graph traversal DSLs. In order to promote this direction, this post will explain Pipes from the vantage point of Gremlin.

An Example Graph Dataset

A graph is a data structure made up of vertices and edges. A property graph is a popular term for a graph data model that has labeled edges and arbitrary key/value pairs associated with both vertices and edges (read more here). In graph theoretic language, a property graph is known as a directed, attributed, multi-relational graph.

All the of the graph traversal examples presented in this post make use of the above diagrammed graph. A minimally extended version of this dataset is distributed with Gremlin and is available at this location (represented in GraphML). For readers that have Gremlin 1.2 downloaded, it’s possible to load up this graph and go through each of the following examples using the Gremlin console. The in-memory graph database TinkerGraph is used throughout. However, note that Gremlin, Pipes, and Blueprints currently work for the following graph database vendors: Neo4j, OrientDB, DEX, and Sail-based RDF stores (e.g. Stardog).

gremlin$ ./gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> g.loadGraphML('data/graph-example-1.xml')
==>null
gremlin> 

Three Types of Pipes

The Pipes API is a collection of atomic “pipes.” A pipe is a simple function that maps an input to an output. The input to a pipe is called a “start” and the output is called an “end.” The general signature of a pipe is Pipe<S,E>. By chaining pipes together, complex computations can be effected. There are 3 general types of pipes.

  1. transform: Given an object of type S, emit an object of type E (see JavaDoc).
  2. filter: Given an object of type S, emit it or not (see JavaDoc).
  3. sideEffect: Given an object of type S, emit it, but yield some computational side-effect in the process (see JavaDoc).

The sections to follow will explain these general forms with specific examples over the example dataset diagrammed previous.

Transform Pipes

During a transformation, an input of type S is mapped/transformed to an output of type E. Below is a basic example where all the steps/pipes used are transformational. The term “step” is used in Gremlin (see documentation). The term “pipe” is used in Pipes. The two terms are nearly analogous.

gremlin> g.v(1).out.name
==>vadas
==>lop
==>josh

In Gremlin, g.v(1) is a method call that returns vertex 1 from the graph. That vertex is passed to the out step (OutPipe) which yields vertices 2, 3, and 4. Finally, the name step (PropertyPipe(“name”)) yields the string name properties of those 2, 3, and 4 vertices.

gremlin> g.v(1).out
==>v[2]
==>v[3]
==>v[4]
gremlin> g.v(1).out.name
==>vadas
==>lop
==>josh

It is important to note that Pipes is not “set-based” in the sense that vertex 2, 3, and 4 are generated before passing those vertices off to the name step (the next stage in the pipeline). On the contrary, vertex 2 is generated, then vadas is generated. Next vertex 3 is generated and then lop. Finally, vertex 4 is generated and then josh. Pipes is a lazy evaluation framework in that as the pipeline is iterated for results, the mappings are computing as needed.

One of the core methods of a pipe is List Pipe.getPath(). This method returns the transformations that an object underwent through a pipeline. This method is demonstrated using the Gremlin step paths (PathPipe).

gremlin> g.v(1).out.name.paths
==>[v[1], v[2], vadas]
==>[v[1], v[3], lop]
==>[v[1], v[4], josh]

Note that any Gremlin expression has a string representation that exposes the underlying pipes being used to represent the expression.

gremlin> g.v(1).out.name.paths.toString()
==>[OutPipe, PropertyPipe(name), PathPipe] 

Filter Pipes

Filter steps/pipes are used to remove particular objects from the flow of a computation. The example below demonstrates one of the significant new features of Pipes 0.7. In Pipes 0.7, there is a PipeClosure (see JavaDoc) interface which can be used with closure-based pipes to provide dynamic functionality. In the examples below filter and transform are the most generic type of filter- and transformation-based pipes, where the filtering and transformation algorithm are specified with a Groovy/Java code snippet. In essence, the provided closure “closes” the meaning/function of the pipe.

Closures typically appear in languages in which functions are first-class values—in other words, such languages allow functions to be passed as arguments, returned from function calls, bound to variable names, etc., just like simpler types such as strings and integers. — Wikipedia

gremlin> g.v(1).out('knows')
==>v[2]
==>v[4]
gremlin> g.v(1).out('knows').filter{it.age < 30} 
==>v[2]
gremlin> g.v(1).out('knows').filter{it.age < 30}.name
==>vadas
gremlin> g.v(1).out('knows').filter{it.age < 30}.name.transform{it.length()}
==>5
gremlin> g.v(1).out('knows').filter{it.age < 30}.name.transform{it.length()}.toString()
==>[OutPipe([knows]), FilterClosurePipe, PropertyPipe(name), TransformClosurePipe]

This example demonstrates how other JVM languages like JRuby (which support closures) and Clojure (which supports anonymous functions) can take advantage of the Pipes framework. In Gremlin, a Groovy closure is wrapped in a GroovyPipeClosure which makes the Groovy closure behave according to the PipeClosure interface and thus, pluggable into Pipes. Any other JVM language with similar features can capitalize on this same pattern.

Side-Effect Pipes

The third type of pipe is a side-effect pipe. A side-effect pipe implements SideEffectPipe<S,T> and takes an object of type S and emits that object. However, in the process, it yields some computational side-effect that can, for instance, be used later in the traversal computation. That side-effect object is of type T. The example below makes use of the most generic side-effect pipe: SideEffectClosurePipe. The example below calculates vertex 1′s co-developers — that is, those vertices that have worked with vertex 1 on a project. Note that a vertex can not be their own co-developer and thus, sideEffect and filter are used appropriately to express this abstract co-developer relationship.

gremlin> g.v(1).sideEffect{x=it}
==>v[1]
gremlin> g.v(1).sideEffect{x=it}.out('created')
==>v[3]
gremlin> g.v(1).sideEffect{x=it}.out('created').in('created')
==>v[1]
==>v[4]
==>v[6]
gremlin> g.v(1).sideEffect{x=it}.out('created').in('created').filter{it != x}
==>v[4]
==>v[6]
gremlin> g.v(1).sideEffect{x=it}.out('created').in('created').filter{it != x}.toString()
==>[SideEffectClosurePipe, OutPipe([created]), InPipe([created]), FilterClosurePipe]

In the example above, the variable x stores the current vertex being pushed through the SideEffectClosurePipe. This variable is later accessed by the FilterClosurePipe to filter out vertices that match the source of the traversal. In other words, path traversals that make use of history can be effected using such a pattern (i.e. non-regular graph traversals).

Branch and Meta Pipes

There are two pipe sub-types that should be discussed: branch and meta pipes. A branch pipe is any pipe that alters the flow of a computation based on some decision criteria (see JavaDoc). The ifThenElse step of Gremlin demonstrates the behavior of the closure-based IfThenElsePipe.

gremlin> g.v(1).out('knows').ifThenElse{it.age < 30}{it.name}{it.out('created').name}           
==>vadas
==>ripple
==>lop

The ifThenElse step takes 3 closures: a condition-closure, a then-closure, and finally, an else-closure. Other branch pipes include copySplit and the respective merging pipes which take multiple branches and yield a single flow: fairMerge and exhaustMerge.

The meta-pipes are perhaps the most bewildering yet most powerful of the pipes in Pipes (see JavaDoc). A meta-pipe is a pipe that maintains pipes internal to it. As such, the meta-pipe can reason on the results of its internal pipes and then effect an appropriate behavior at the meta-pipe level. To better explain the concept of a MetaPipe, backtracking and looping are demonstrated.

gremlin> g.v(1).out('knows').name
==>vadas
==>josh
gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}
==>vadas
gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}.back(2)
==>v[2]

The above Gremlin code snippet and diagram demonstrate the behavior of back (BackFilterPipe). If an object reaches back, emit the object that it was n steps ago. In the example above, n is 2 and thus, two steps ago from the string vadas was vertex 2. While this linear sequence articulates backtracking, the diagram below better explains how this computation is implemented in Pipes.

gremlin> g.v(1).out('knows').name.filter{it[0]=='v'}.back(2).toString()
==>[OutPipe([knows]), BackFilterPipe[[PropertyPipe(name), FilterClosurePipe]]]

While not diagrammed, it is possible to use named-steps instead of integers (e.g. 2) to denote how many steps to back up to. As an expression becomes more complex, using named-steps makes the expression more readable and manageable.

gremlin> g.v(1).out('knows').as('here').name.filter{it[0]=='v'}.back('here')
==>v[2]

Finally, to conclude this section, the meta-pipe LoopPipe is explained. LoopPipe is both a meta-pipe and a branch pipe in that is can control the flow of objects and, like BackPipe makes use of internal pipes during its operation.

gremlin> g.v(1).out.loop(1){it.loops < 3}
==>v[5]
==>v[3]

In the example above, the loop step passes the current object to the provided “while”-based condition-closure. If the closure yields true, the object is inserted into the pipe from n steps ago. Moreoever, metadata is appended to the object. In the example above the loops metadata denotes how many times the object path has gone through the loop pipe. Thus, when the object path has gone through 3 times, the condition-closure fails and the object is passed on to the next stage of the computation/pipeline.

gremlin> g.v(1).out.loop(1){it.loops < 3}.name
==>ripple
==>lop

While a linear interpretation is possible, underneath the hood, loop is a meta-pipe. The diagram below better represents this.

gremlin> g.v(1).out.loop(1){it.loops < 3}.name.toString()
==>[LoopPipe[[OutPipe]], PropertyPipe(name)]

Conclusion

There are numerous pipes that come with the Pipes distribution. These pipes can be mixed and matched to yield lazy graph traversals. A description of all the pipes in Pipes 0.7 is provided in the associated JavaDoc. Note that it is easy to create a pipe by simply extending AbstractPipe and implementing the protected method AbstractPipe.processNextStart(). This is handy for developers wanting to create domain-specific pipes and use them in Pipes-based DSLs like Gremlin. Two trivial examples are provided below.

public class NumCharsPipe extends AbstractPipe<String,Integer> {
  public Integer processNextStart() {
    return this.starts.next().length();
  }
}

public class MoreThanFourCharsPipe extends AbstractPipe<String,String> implements FilterPipe<String> {
  public String processNextStart() {
    while(true) {
      String s = this.starts.next();
      if(s.length() > 4)
        return s;
    }
  }
}
gremlin> Gremlin.defineStep('numChars', [Pipe], {new NumCharsPipe()})
==>null
gremlin> Gremlin.defineStep('moreThan4Chars', [Pipe], {new MoreThanFourCharsPipe()})
==>null
gremlin> g.v(1).out.name.numChars                                                
==>5
==>3
==>4
gremlin> g.v(1).out.name.moreThan4Chars
==>vadas

NOTE: Gremlin provides an easier way to implement user-defined steps. The example above is provided to demonstrate the boundary between the Java-based Pipes and the Groovy-based Gremlin.

The TinkerPop crew plans to further open up the Pipes data flow model to other JVM languages (for example, see Pacer). Pipes 0.7 introduced the flexible PipeClosure concept which will make it easy for DSL designers to make use of Pipes as the backend engine of their graph traversal language. Moreover, with this model, there is a central point for optimization and feature development which will hopefully ensure a bright future for lazy graph traversal languages implemented on the JVM.

TinkerPop InfoGraphic

Graphs, Brains, and Gremlin

SMI32-stained pyramidal neurons in the cerebral cortex. Provided by UC Regents Davis under Creative Commons Attribution 2.5 License

What do graphs and brains have in common? First, they both share a relatively similar structure: Vertices/neurons are connected to each other by edges/axons. Second, they both share a similar process: traversers/action potentials propagate to effect some computation that is a function of the topology of the structure. If there exists a mapping between two domains, then it is possible to apply the processes of one domain (the brain) to the structure of the other (the graph). The purpose of this post is to explore the application of neural algorithms to graph systems.

The Rosetta Stone of Graphs

Before diving specifically into graphs and brains, its important to see that at a certain level of generality, many real-world structures are graphical in form. The article “A Rosetta Stone for Connectionism” (full download) is a Santa Fe Institute article that presents mappings between a generic graph and various dynamical systems from the vantage point of connectionism (i.e. artificial neural networks).

“The term connectionism is usually applied to neural networks. There are, however, many other models that are mathematically similar, including classifier systems, immune networks, autocatalytic chemical reaction networks, and others. In view of this similarity, it is appropriate to broaden the term connectionism. I define a connectionist model as a dynamical system with two properties: (1) The interactions between the variables at any given time are explicitly constrained to a finite list of connections. (2) The connections are fluid, in that their strength and/or pattern of connectivity can change with time.”

In the article, Dr. J. Doyne Farmer goes on to provide the “Rosetta Stone” for connectionism. He draws parallels between abstract graphical structures (generic) and real-world systems that respect the generic graphical form (e.g. neural networks). The thesis table from the article is reproduced below with some slight modifications (italicized) to identify the terminology of the currently popularized property graph data model which is discussed in the second half of this post. Of particular interest to this post are the first two columns (generic/graph and neural network/brain).

Generic Neural Network Classifier System Immunity Network Autocatalytic Network
node, vertex neuron message antibody type polymer species
state, vertex properties activation level intensity free antibody polymer concentration
connection, edge axon/synapse classifier chemical reaction catalyzed chemical reaction
parameters, edge properties connection weight strength and specificity reaction affinity catalytic velocity
interaction rule, traversal sum/sigmod linear threashold bell-shaped mass action
learning algorithm Hebbian, back propagation bucket brigade clonal selection approach to attractor
graph dynamics synaptic plasticity genetic algorithms genetic algorithms artificial chemistry rules

Why are brains interesting for those doing software engineering, data management, and information retrieval? Simply put, the brain provides a unique information processing/management style that is very different from the style of today’s popular relational and document databases. For the brain/graph, the world is a fuzzy realm of similarities and associations amongst highly interconnected data. For the relational/document database, the world is discrete and exact with data being atomic and reductionistic. Today’s graph databases provide a data organization model that mimics many of the structural qualities of the brain and as such, can efficiently evaluate neurally-inspired algorithms.

Spreading Activation and the Brain

Spreading activation is an algorithm that simulates the real-world process of a cascade of action potentials (neuron depolarizations). When a neuron reaches a certain level of depolarization, it release neurotransmitters which, when in sufficient quantity and of the excitatory type, can trigger the depolarization of any number of adjacent neurons. The wave of action potentials across a neural network is known as a spreading activation.

Sensations to Concepts

This “spreading of activation” throughout the brain yields human cognition. Here is a particular example based on vision: light enters the eye, hits the retina, triggers the depolarization of certain ganglia cells in the retina which propagate a spreading activation through to the thalamus, and then off to the visual cortex in the back of the brain. It is in the visual cortex and up through to the rest of the cerebral cortex that the spreading activation is “interpreted” (i.e. propagated) via a hierarchy of neurons that denote ever more invariant structures. Invariants can be understood as another word for a concept. For instance, irrespective of seeings the front of a lion, the side of a lion, or even a drawing or sound of a lion, the conception of a lion is realized. Irrespective of the constant barrage of seemingly random signals entering through the human sensory system, a stable, consistent world of objects is ultimately the conscious experience of the human. For the human, the world is not a 2D space of rapidly changing bits of color. Instead, its a 3D space of stable objects and their physical and conceptual relationships to one another. Mental concepts surpress/compress/generalize sensory novelty.

Where do concepts begin? Do they begin in the upper levels of the cortical hiearchy? The answer is “no.” Concepts start immediately. Every step of a spreading activation triggers more complex concepts—i.e. higher-level invariants. The further the propagation, the more invariant (abstract/general/encompassing) the concept. A sensation activates neurons which denote simple concepts such as lines at different angles. These action potentials then trigger more complex concepts such as a cylinder, which merge with color concepts and fur concepts to yield a lion’s tail and, ultimately, to the complex concept of a lion. A “lion” is the excitation of various pathways that uniquely converge to elicit the conception/qualia of a lion (see “the binding problem“). Spreading activation can be expressed as the process of the branching and merging of a sensory signal through various layers of neural tissue, where each layer contains neurons which denote more abstract invariants.

Learning and Behavior

Concepts are triggered from the realization of other concepts. Humans learn the associations between concepts through experience. Quite simply, at the neural level, in Hebbian-based learning, “neurons that fire together, wire together.” The co-occurrence of activation potentials in a cluster of neurons leads them to “bundle,” such that in the future, if one neurons fires, there is a high probability that the others will fire as well. Given this rule, humans associate lions with danger and danger with running. The associations between concepts may elicit behavior (“that’s a lion, run!”) by directing the propagation of the spreading activation to the motor cortex. The motor cortex translates conceptual actions into the manipulation of muscles that ultimately yield the human’s behavior in the world. On and on, in a constant interactive process (“the lion is gaining, run faster!”) we are tossed about reality. We spend so much of our conscious lives embalmed in a web of high-level abstractions that we never realize the insanity of the relentless novelty right in front of our eyes. For we “are” the cortex.

Gremlins in the Brain

When data is in a graph database and optimized for graph traversals, its possible to take advantage of the algorithms distilled from cognitive neuroscience research. In fact, with graph traversal languages like Gremlin, neural-inspired data management and computing is only a few keystrokes away. This section draws parallels between the neural algorithms discussed previous and those aspects of Gremlin that make such algorithms feasible to implement.


NOTE: All the examples presented are as of Gremlin 1.2-SNAPSHOT+. There are a few tweaks since Gremlin 1.1 for which these code fragments are oriented.

Action Potentials

A traversal description will propagate traversers (i.e. gremlins) over vertices and edges in the graph. These traversers can update the state of a vertex (or edge) by simply manipulating its properties (refer to the first column of the table above). For example, given a set of root/start vertices, the following code will increment the activation/depolarization of each vertex adjacent to the root vertices.

root.out.sideEffect{it.activation++}

As a vertex becomes more activated, it may “fire” if a certain threshold of activation is reached. When the vertex fires, it provides activation to its adjacent vertices. Building on the previous code fragment, spreading activation continues if and only if the activation has reached a particular threshold. The following pattern can be used to simulate this behavior, where aggregate will ensure a breadth-first evaluation.

root.out.sideEffect{it.activation++}.aggregate.filter{it.activation > threshold}

In a weighted graph, not all edges are created equal. In such situations, the firing of vertex A may have less of an effect on vertex B than vertex C has on B. This is the case where the strength of tie is a function of the edge weight. To ensure that the activation level of a vertex is respective of the edge weight, the following code fragment can be used. The code simply saves the edge weight to x and then applies it to the vertex’s activation.

root.outE.sideEffect{x=it.weight}.inV.sideEffect{it.activation+=x}.aggregate.filter{it.activation > threshold}

To proceduralize this computation, represent it as a single step called spread which takes a threshold parameter.

Gremlin.defineStep('spread', [Vertex,Pipe], { def threshold ->
 _().outE.sideEffect{x=it.weight}.inV.sideEffect{it.activation+=x}.aggregate.filter{it.activation > threshold}
});

In the newly abbreviated form, the following code iterates a spreading activation for some indeterminate number of steps (until no more vertices are activated) with the parameterized threshold arbitrarily set to 50.

root.spread(50).loop(1){true}

In summary, given a set of root vertices, a spreading activation is propagated through the graph, where only those pathways that receive enough stimulation/depolarization are followed.


NOTE: The activations at each step are not reset back to 0. As such, activation builds up over the course of the propagation. A simple modification to the filter step rectifies this situation. However, for the sake of simplicity/readability, it was left out.

filter{go=(it.activation > threshold); it.activation=0; go}

Hebbian Learning

When a cluster of neurons synchronously fire, they are reacting to the same stimuli and thus, can be seen as different interpretations of the same phenomena. In a similar fashion, concepts are associated to each other when their co-realization occurs (for water is wet).

root.spread(50).gather{ def vertices -> 
  vertices.each { a ->
    vertices.each{ b ->
      if(a != b) {
        if(!a.out.asList().contains(b))
          g.addEdge(a,b,"connection",[weight:1]);
        else 
          a.outE.inV{it==b}.back(2).sideEffect{it.weight++};
      }
    }
  }
}.scatter.loop(3){true}

In the above code, at each step of the propagation, “vertices that fire together, wire together.” For those vertices that already share an edge, the strength/weight of their edge is incremented. One of the fascinating aspects of the Hebbian algorithm is that as the graph is being processed, its structure is changing to reflect in situa learning. In this way, query and manipulation are coupled.

Finally, it is possible to make the Hebbian code fragment into a single step as was previously done for spreading activation. Together, spreading activation with Hebbian-based learning can be expressed in Gremlin as follows:

root.spread(50).learn.loop(2){true}

Conclusion

A graph is a generic structure. The traversal is a generic process. Traversals over graphs have many similarities to spreading action potentials over brains. This post presented the mapping between graphs and brains and provided example Gremlin code snippets that mimic standard neural algorithms. There are many variations to the themes presented that can be explored and tailored to the developer’s particular problem requirements. Moreover, given that most graph databases make use of the property graph data model, hybrid algorithms that mix semantics with spreading activation can be explored to various fruitful ends. To conclude, a collection of interesting extensions are itemized.

  1. Inhibitory traversals: there are excitatory and inhibitory neurotransmitters. The function of inhibition is to reduce noise in a signal, accentuate conceptual boundaries, and ultimately ensure that a spreading activation only goes down particular pathways.
  2. Forgetting: Over time, low weighted edges can be pruned or specific regions of the graph can be “lesioned” to promote future growth to future stimuli.
  3. Hierarchal architectures: invariants can be realized by making use of structural hierarchies within the graph. This supports classification as well as concept learning and association.
  4. Neural growth: new vertices can be introduced into the graph in order to support the generation of yet more concepts and mappings to stimuli.
  5. Processing regions: it is possible to compartmentalize particular areas of the graph and wire them together as a way of organizing the structure of the graph into functional subgraphs.

References

Hawkins, J., Blakeslee, S., “On Intelligence,” Times Books, September 2004.

Rodriguez, M.A., Ham, M.I., Gintautas, V., Kunsberg, B.S., “A Prospectus on the Obstacles Inhibiting the Implementation of Advanced Artificial Neural Systems – Part 1,” Decade of Mind IV Conference, Albuquerque, New Mexico, January 2009.

Riesenhuber, M., “The HMAX Model,” The Laboratory for Computational
Cognitive Neuroscience at Georgetown University.

Rodriguez, M.A., “From the Signal to the Symbol: Structure and Process in Artificial Intelligence,” Center for Nonlinear Studies Post Doctorate Seminar, Los Alamos National Laboratory, Los Alamos, New Mexico, November 2008.

Langer, S.K., “Philosophy in a New Key: A Study in the Symbolism of Reason, Rite, and Art,” Harvard University Press, January 1957.

Healy M.J., Caudell, T.P., “Ontologies and Worlds in Category Theory: Implications for Neural Systems, Axiomathes, 16(1-2), pp. 165-214, Springer, 2006.

Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems, 21(7), pp. 727-739, Elsevier, 2008.

Note on images: All non-captioned images from Wikipedia are public domain.

Graph Pattern Matching with Gremlin 1.1

Gremlin 1.1 was released on June 15, 2011. A major aspect of this release includes traversal-based graph pattern matching. This post provides a review of this functionality.

NOTE: For those that are more than casually interested, the GraphML representation of the graph diagrammed above and used throughout the examples is provided here. To evaluate the examples, download Gremlin 1.1 and load the graph with the following commands.

g = new TinkerGraph() // in-memory graph database
g.loadGraphML('traversal-pattern-example.xml')

Traversing vs. Pattern Matching

There are two schools of thought regarding graph querying: graph traversing and graph pattern matching. A graph traversal is usually accomplished by

  1. determining a set of root/start vertices,
  2. emanating from those vertices according to some an abstract path description,
  3. and finally, yielding some side-effect in the process (e.g. the end of the path, intermediate computations, etc.).

The diagram above provides a simple example of a graph traversal in Gremlin. The root is vertex 1 and the abstract path description describes what the root’s friends have purchased. The result of this traversal yields the vertices 3, 4, and 9 (i.e. the side-effect is the end of the path).

A graph pattern match, on the other hand, is accomplished by providing an abstract (sub)graph pattern with any number of grounded instances (vertices, edges in the graph) and variables (called bindings). The query engine will then determine all subgraphs of the larger graph that fit that pattern. In the diagram above, the popular pattern match language SPARQL is used, where 1, knows, and bought serve as grounds and ?x and ?y serve as variables. Given the query in the diagram, what is returned is a table of results.

--------------
| ?x   | ?y   |
|-------------|
| v[8] | v[9] |
| v[2] | v[4] |
| v[2] | v[3] |
 -------------

Beyond the declarative/imperative aspects of the two models, what differentiates graph traversing and graph pattern matching is how state is saved (variables are bound) through the course of an evaluation. In order to bridge these two approaches, Gremlin 1.1 now provides a table data structure and the ability to name steps in a path. These two features conveniently provide traversal-based graph pattern matching. Now it is possible to emulate the behavior of SPARQL in property graphs while simultaneously, leveraging the traversal aspects of Gremlin.

SELECT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y 
}

--------------
| ?x   | ?y   |
|-------------|
| v[8] | v[9] |
| v[2] | v[4] |
| v[2] | v[3] |
 -------------
gremlin> t = new Table()                                              
gremlin> g.v(1).out('knows').as('x').out('bought').as('y').table(t)
==>v[9]
==>v[4]
==>v[3]
gremlin> t
==>[x:v[8], y:v[9]]
==>[x:v[2], y:v[4]]
==>[x:v[2], y:v[3]]

In the Gremlin example above, the path expression yields the end of the path (vertices 3, 4, and 9). However, the table t is lazily filled with the intermediate step data as demarcated by the as steps. In this way, t is the table of variable bindings.

NOTE: For those familiar with Gremlin 0.7–1.0, the paths step allows for similar functionality (see Paths Pattern). The difference being that with table, steps can be named and a convenient table data structure can be constructed that provides methods for slicing and dicing a table and for post-processing table entries. Moreover, as will be demonstrated, there are other ramifications afforded that are difficult to accomplish simply with paths.

Graph Pattern Matching Examples

What has been presented thus far are only the basics. By making use of other steps in Gremlin and the methods afforded by Table, sophisticated pattern matching behavior can be effected. There are many possibilities now that traversing and pattern matching are combined. A “grab bag” of useful and far-reaching examples is provided below.


Range Filter as Limit and Offset

In SPARQL there is the LIMIT and OFFSET clause which can be used to reduce the number of results returned.

SELECT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y 
}
LIMIT   5
OFFSET  10

In Gremlin, use a range filter to limit the number of rows added to the table.

g.v(1).out('knows').as('x').out('bought').as('y')[10..15].table(t)

Of course, for the simple “knows/bought”-graph diagrammed, there are not 15 results.


Filter Steps as Pattern Match Filters

Graph pattern matching languages usually allow for the application of filters. A filter will match all patterns where some literal of the patten is checked to make sure some predicate holds (e.g. age is greaten than 21, name starts with “M”, etc.). Below, a SPARQL query is provided where the name of vertex 1′s friends who bought products must start with “pe” in order to match the pattern.

SELECT ?x ?y WHERE {
  1 knows ?x .
  ?x name ?name .
  ?x bought ?y .
  FILTER regex(?name, "pe.*")
}

By using standard Gremlin filters within a path description, the same result is returned.

g.v(1).out('knows').filter{it.name.matches('pe.*')}.as('x').out('bought').as('y').table(t)

   [x:v[8], y:v[9]]

Unique to Return Distinct Results

In SPARQL, its possible to remove duplicates using DISTINCT.

SELECT DISTINCT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y .
}

With Gremlin, there is the method Table.unique(). Once a table of results has been generated, evaluate Table.unique(). The example below is not based on the diagrammed graph as there are no duplicates in that example.

gremlin> t
==>[b, 2]
==>[a, 1]
==>[a, 1]
gremlin> t.unique()
==>[b, 2]
==>[a, 1]

Sort to Order Results

In SPARQL, its possible to order results using ORDER.

SELECT DISTINCT ?x ?y WHERE {
  1 knows ?x .
  ?x bought ?y .
}
ORDER BY ?x

With Gremlin, there is the method Table.sort(). Once a table of results has been generated, evaluate Table.sort(). Again, the example below is not based on the diagrammed graph.

gremlin> t
==>[b, 2]
==>[a, 1]
==>[a, 1]
gremlin> t.sort()
==>[a, 1]
==>[a, 1]
==>[b, 2]
gremlin> t.unique().sort()
==>[a, 1]
==>[b, 2]

NOTE: For both unique() and sort() there are overloaded methods that allow for closures and comparators to be added to accomplish behaviors like arbitrary uniqueness criteria, reverse sort, etc. For those familiar with Gremlin, a cap can be used to do inline sorting and uniquing.


Multiple Patterns within a Single Traversal

In all the examples thus far, the table step has been the last step. This need not be the case. For example, it is possible to do a pattern match in the beginning of an expression and still accomplish some other traversal computation in the latter part. Moreover, it is possible to have multiple tables within the same expression and in doing so, match multiple patterns within the same traversal. In the example below, table t1 refers to vertex 1 and his friends. Table t2 refers to vertex 1′s friends and their purchased products. Note that the second table step is provided a list of as names to index (i.e. y and z).

g.v(1).as('x').out('knows').as('y').table(t1).out('bought').as('z').table(t2, ['y','z'])

t1
   [x:v[1], y:v[8]]
   [x:v[1], y:v[2]]
t2
   [y:v[8], z:v[9]]
   [y:v[2], z:v[4]]
   [y:v[2], z:v[3]]

Using Table Column Closures

A column closure can be evaluated over an entry in order to process the result prior to insertion into the table. In the contrived example below, instead of saving vertices, the names of the friends and the lengths of the names of the purchased products are inserted into the table.

                                            
g.v(1).out('knows').as('x').out('bought').as('y').table(t){it.name}{it.name.length()}

   [x:peter, y:7]
   [x:pierre, y:6]
   [x:pierre, y:4]

This can also be effected in a post-processing manner by using the Table.apply(Closure[]) method: t.apply{it.name}{it.name.length()}.


Backtracking and Pattern Matching

A neat feature of Gremlin (inspired by XPath) is the ability to backtrack (see Backtrack Pattern). With the back step, its possible to explore a particular path and then back up to a previous part of the path before moving forward again. Visually, this is like going down one fork of a path, returning to the fork and going down another fork in the path–all within a single linear expression. In the example below, the name of the friend, the number of friends each friend has, and the products they purchased are saved to the resultant table.

g.v(1).out('knows').name.as('x').back(1).transform{it.out('knows').count()}.as('y').back(1).out('bought').as('z').table(t)

   [x:peter, y:1, z:v[9]]
   [x:pierre, y:0, z:v[4]]
   [x:pierre, y:0, z:v[3]]

This example is a bit contrived as a similar behavior can be effected using the table column closures model discussed previous. However, when back is utilized for its filtering capabilities, backtracking and pattern matching play off each other in an interesting way.


Recursive Pattern Matching

The simple “knows/bought”-graph used thus far is not complex enough to describe recursive pattern matching. Therefore, a new graph is needed. Imagine an genealogy tree like the one diagrammed below.

g.v(1).as('x').out('parent').as('y').table(t).loop('x'){it.out.count() != 0}

In the Gremlin example above, the loop step will loop over out(‘parent’).as(‘y’) until it reaches the top of the tree (see Loop Pattern). The table t will contain all the ancestor relationships between vertex 1 all the way back to the first parent (vertex infinity). For example, a subset of the table is:

[x:1, y:2]
[x:1, y:3]
[x:1, y:4]
[x:1, y:5]
...

The ability to loop through sections of an expression while, at the same time, matching patterns seen within the recursion is a major boon achieved when mixing graph traversing and graph pattern matching.


Bonus Examples

The examples presented have demonstrated functionality in isolation in order to articulate common use cases with the clearest possible syntax. However, as many know, Gremlin can get a bit “crazy” once you get the hang of it — it’s been called “Perl for graphs.” For fun, lets get crazy.

PROBLEM: Here is a pattern match example. Try and figure out what it computes.

gremlin> g.v(1).out('knows').as('x').out('bought').as('y').table(new Table()){it.name}{it.name}.cap.transform{it.unique().sort{a,b -> a.get(1) <=> b.get(1)}.apply{it.substring(0,2)}{it}} >> 1
==>[pe, arduino]
==>[pi, fpga]
==>[pi, tether]

SOLUTION: Get vertex 1′s friends and what they have purchased. Insert those into a new table with the entries being the name of the friends and purchased products. Cap the table step/pipe in order to emit the table, not the end of the path. Transform the table into a de-duplicated version sorted by the names of the products with the first column being the first two characters of the person’s name. Finally, return the single table.

Conclusion

Gremlin 1.1 boasts named steps. With the as step, it’s possible to name a particular step in the path and use it with back, optional, loop, and table. For those using Gremlin previous to 1.1, named steps is a welcome addition that makes Gremlin code easier to read. However, the main achievement of Gremlin 1.1 is the ability to do graph pattern matching using graph traversal techniques. This development unites two seemingly disparate graph query worlds. It’s possible to now evaluate a complicated traversal over a graph and at the same time, save particular aspects of that traversal to a table result set — two birds with one stone.

References

This blog post used examples from SPARQL to frame Gremlin’s 1.1 graph pattern matching constructs. Note that there are many other graph languages out there in the wonderful world of graphs — more than those mentioned below. Finally, please see the Pattern Match Pattern documentation for more on Gremlin graph pattern matching.

  • Ripple: An RDF Path Language
  • Cypher: A Property Graph Pattern Match Language
  • Versa: An RDF Path Language
  • RQL: An RDF Pattern Match Language
  • Putting and Getting Data from a Database

    Databases support the putting and getting of data. What distinguishes one database type from another is the structure of the data they store and the means by which that data is retrieved. There are numerous types of databases. However, this short post will only discuss a few.

    Primitive Store: One can imagine the most basic database which contains only simple data values. Let’s call such a database a “primitive store.” With a primitive store, there is no structure to the data. Each datum is a primitive value in a single unordered set (i.e. a bag of primitives). Primitives can be inserted into the set and later retrieved.

    // put data
    db.put('marko');
    db.put(31);
    db.put(true);
    // get data
    Iterator results = db.get();
    Iterator filteredResults = db.get{it.startsWith('ma')};
    

    Key/Value Store: For key/value-based storage systems the data structure is an ordered 2-tuple that contains a key and a value. Other names for this data structure include an associative array, a map, or a dictionary. When interacting with a key/value store, data pairs are inserted into the database and values are retrieved based on their keys.

    // put data
    db.put('name', 'marko');
    db.put('age', 31);
    // get data
    Object value = db.get('name');
    

    Document Store: The next level of structural complexity is the document store. Documents are tree/nested structures such as those provided by XML and JSON. This added level of complexity provides more modeling flexibility to the developer. For example, one can model a person as a document as follows:

    Document document = { 
      type : "person",
      name : "marko",
      age : 31,
      skills : {
        languages : {
          ["java", "groovy", "gremlin", "R"]
         }
      }
    }
    

    To find a person that is skilled in R, a “query object” is constructed. In a manner similar to classic tuple systems, the query object provides required bindings. For all fields not specified, a wildcard is assumed.

    // put data
    db.put(document);
    // get data
    Document result = db.findOne({ 
        type : "person", 
        skills.languages : "R"
    }); 
    

    Finally, most key/value and document stores support the MapReduce pattern. With MapReduce, each document (or key/value) is processed in parallel and the result of each process is aggregated and returned to the user. For example, a map() and reduce() function can be written to determine the distribution the languages of the people in the database.

    Graph Database: The data structure of a graph database is an arbitrarily connected component known as a graph (e.g. sequences, trees, lattices, cycles, etc.). In some ways, a graph database is like a document database save that particular fields in a document can make direct reference to other documents. In graph jargon, the objects are known as vertices and the relationships between vertices are known as edges.

    Vertex v = db.putVertex({'name' : 'marko', 'age' : 31});
    Vertex u = db.putVertex({'language' : 'gremlin'});
    db.putEdge(v, 'hasSkill', u);
    

    There are two ways to retrieve data from a graph database: pattern matching and traversing. A pattern match query is similar, in many ways, to document querying in a document store. A graph pattern is defined where particular components are variables (i.e. wildcards). The classic language for graph-based pattern matching is SPARQL.

    SELECT ?x ?y WHERE {
      ?x knows marko .
      marko hasSkill ?y .
      ?x hasSkill ?y
    }
    

    In the query above, the variables ?x and ?y bind to people and skills, respectively. However, the graph pattern rule says that binding only occurs for those people who know Marko and share a skill with him.

    The second query model is graph traversing. With a graph traversal, an arbitrary number of vertices/edges can be touched (and looped over) in order to yield end points (traversal destinations), paths (traversal history), or a resultant computation (traversal side-effect). In the graph traversal language Gremlin, the ages of the people that share a skill with Marko can be determined using the following query (traversal destination):

    marko.out('hasSkill').in('hasSkill').except([marko]).age
    

    To determine all non-looping friendship paths between Marko and Josh the following can be evaluated (traversal history):

    marko.out('friend').uniquePath.loop(2){it.object != josh}.paths
    

    Finally, to determine the centrality of every vertex in the graph, a traversal of the following form can be used, where m is a map that stores the resulting centrality score for each vertex (traversal side-effect).

    m = [:]
    g.V.out.groupCount(m).loop(2){it.loops < 100}
    

    There are numerous types of databases. Each provides a means of storing and retrieving structured data. Depending on the complexity of the domain being modeled and the types of computations required, there exists a database out there to meet the needs of every project.

    References

    Rodriguez, M.A., “An Overview of Data Management Paradigms: Relational, Document, and Graph,” Data Management Workshop, University of New Mexico, February 2010.

    Webber, J., “Square Pegs and Round Holes in the NOSQL World,” World Wide Webber, April 2011.

    Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin of the American Society for Information Science and Technology, 36(6), pp. 35–41, doi:10.1002/bult.2010.1720360610, August 2010.

    Rahien, A., “That No SQL Thing – Key/Value Stores,” Unnatural Acts on Source Code, March 2010.

    Local and Distributed Traversal Engines

    In the graph database space, there are two types of traversal engines: local and distributed. Local traversal engines are typically for single-machine graph databases and are used for real-time production applications. Distributed traversal engines are typically for multi-machine graph databases and are used for batch processing applications. This divide is quite sharp in the community, but there is nothing that prevents the unification of both models. A discussion of this divide and its unification is presented in this post.

    Local Traversal Engines

    In a local traversal engine, there is typically a single processing agent that obeys a program. The agent is called a traverser and the program it follows is called a path description. In Gremlin, a friend-of-a-friend path description is defined as such:

    g.v(1).outE('friend').inV.outE('friend').inV
    

    When this path description is interpreted by a traverser over a graph, an instance of the description is realized as actual paths in the graph that match that description. For example, the Gremlin traverser starts at vertex 1 and then steps to its outgoing friend-edges. Next, it will move to the head/target vertices of those edges (i.e. vertices 3 and 4). After that, it will go to the friend-edges of those vertices and finally, to the head of the previous edges (i.e. vertices 6 and 7). In this way, a single traverser is following all the paths that are exposed with each new atomic graph operation (i.e. each new step after a .). The abstract syntax being: step.step.step. What is returned by this friend-of-a-friend path description, on the example graph diagrammed, is vertices 6 and 7.

    In many situations, its not the end of the path that is desired, but some side-effect of the traversal. For example, as the traverser walks it can update some global data structure such as a ranking of the vertices. This idea is presented in the path description below, where the outE.inV path is looped over 1000 times. Each time a vertex is traversed over, the map m is updated. This global map m maintains keys that are vertices and values that denote the number of times that each vertex has been touched (groupCount‘s behavior).

    m = [:]
    g.v(1).outE.inV.groupCount(m).loop(3){it.loops < 1000}
    

    The local traversal engine pattern is abstractly diagrammed on the right, where a single traverser is obeying some path description (a.b.c) and in doing so, moving around on a graph and updating a global data structure (the red boxed map).

    Given the need for traversers to move from element to element, graph databases of this form tend to support strong data locality by means of a direct-reference graph data structure (i.e. vertices have pointers to edges and edges to vertices). A few examples of such graph databases include Neo4j, OrientDB, and DEX.

    Distributed Traversal Engines

    In a distributed traversal engine, a traversal is represented as a flow of messages between the elements of the graph. Generally, each element (e.g. vertex) is operating independently of the other elements. Each element is seen as its own processor with its own (usually homogenous) program to execute. Elements communicate with each other via message passing. When no more messages have been passed, the traversal is complete and the results of the traversal are typically represented as a distributed data structure over the elements. Graph databases of this nature tend to use the Bulk Synchronous Parallel model of distributed computing. Each step is synchronized in a manner analogous to a clock cycle in hardware. Instances of this model include Agrapa, Pregel, Trinity, and GoldenOrb.

    An example of distributed graph traversing is now presented using a ranking algorithm in Java. [NOTE: In this example, edges are not first class citizens. This is typical of the state of the art in distributed traversal engines. They tend to be for single-relational, unlabeled-edge graphs.]

    public void evaluateStep(int step) {
      if(!this.inbox.isEmpty() && step < 1000) {
        this.rank = this.rank + this.inbox.size();
        for(Vertex vertex : this.adjacentVertices()) {
          for(int i=0; i<this.inbox.size(); i++) {
            this.sendMessage(vertex);
          }
        }
        this.inbox.clear();
      }
    }
    

    Each vertex is provided the above piece of code. This code is executed by every vertex at each step (in the Bulk Synchronous Parallel sense — “clock cycle”). For a concrete example, a “start message” to, lets say, vertex 1 initiates the process. Vertex 1 will increment its rank by 1. It will then send messages to its adjacent vertices. On the next step, the vertices adjacent to 1 update their rank by the number of messages they received in the previous step. This process continues until no more messages are found in the system. The aggregation of all the rank values is the result of the algorithm. The general pattern of a Bulk Synchronous Parallel step is: 1.) get messages 2.) process messages 3.) send messages.

    When edges are labeled (e.g. friend, created, purchased, etc. — multi-relational graphs), its necessary to have greater control over how messages are passed. In other words, its necessary to filter out particular paths emanating from the start vertex. For instance, in a friend-of-a-friend traversal, created-edges should be ignored. To do so, one can imagine using the Gremlin language within a distributed graph traversal engine. In this model, a step in Gremlin is a step in the Bulk Synchronous Parallel model.

    Given the running example of a.b.c in the local engine model, a Gremlin-esque distributed engine example is as follows. A “start message” of a.b.c is sent to vertex 1. Vertex 1 will “pop off” the first step a of the path description and then will pass a message of b.c to those vertices that satisfy a. In this way, at step 2, vertices 2 and 3 have b.c waiting for them in their inbox. Finally, at step 3, vertex 6 receives the message c. If a side-effect is desired (e.g. vertex rankings), a global data structure should be avoided. Given the parallel nature of distributed traversal engines, it is best to avoid thread blocking when writing to a global data structure. Instead, each vertex can maintain a portion of the data structure (the red boxes). Moreover, for non-regular path descriptions (require memory), local data structures can serve as a “scratch pad.” Finally, when the computation is complete a “reduce”-phase can yield the rank results as an aggregation of the data structure components (the union of the red boxes).

    There is much work to be done in this area and members of the TinkerPop team are currently researching a unification of the local and distributed models of graph traversal. In this way, certain steps of a path description may be local and then others may be “fanned out” to run in parallel. We are on the edge of our seats to see what will come out of this thought traversal.

    Acknowledgements

    The contents of this post have been directly inspired by conversations with Alex Averbuch, Peter Neubauer, Andreas Kollegger, and Ricky Ho. Indirect inspiration has come from my various collaborators and all the contributors to the TinkerPop space.

    Follow

    Get every new post delivered to your Inbox.

    Join 140 other followers