Skip to content

Graph Systems Architect




Graph technologies
      ease the modeling of your domain and
      improve the simplicity and speed of your queries.

Global vs. Local Graph Ranking

Graph ranking algorithms are all about mapping a complex graphical structure to a numeric vector. For a given algorithm, a single numeric value in the resultant vector corresponds to the score of a particular vertex in the graph. In code, the previous structures are defined below using Groovy and Blueprints.

def g = new Neo4jGraph('/tmp/neo4j'); // the graph
def m = new HashMap<Vertex,Double>(); // the rank vector

Graph ranking can be decomposed into two sub-concepts: global and local ranks. This distinction is brought to light in this post.

Global Rank Metrics

Many of the popular graph ranking algorithms are global. A nice laundry list is provided on the centrality Wikipedia page. The textbook standards include the eigenvector-based algorithms (e.g. PageRank, eigenvector centrality) and the geodesic-based algorithms (e.g. betweenness, closeness centrality). What these algorithms have in common is that they rank all the vertices in the graph relative to all the vertices in the graph. As such, the entire graph is analyzed to determine the score of any one vertex, where the returned rank vector is the size of the total number of vertices in the graph. This idea is expressed in Groovy, Blueprints, and Gremlin below.

public Map<Vertex,Double> global(Graph g) {
  g.V.each{ ... }
}

def g = new Neo4jGraph('/tmp/neo4j');
assert global(g).size() == g.V.count();

Note that the global rank method only takes a single graph argument. This is because the graph has all the information needed to calculate the ranking–all the vertices can be derived from the graph. However, nothing prevents a subset of the vertices being ranked relative to some subset of vertices. Such algorithms are called local rank metrics and are discussed next.

Local Rank Metrics

A key article articulating the concept of local rank is Algorithms for Estimating Relative Importance in Networks by Scott White and Padhraic Smyth. In this article, the authors chose to explore the question:

“Which entities are most important in the [graph] relative to a particular individual or set of individuals?

The codified structure of this statement is provided below.

public Map<Vertex,Double> local(Set<Vertex> roots) {
  roots.each{ ... }
}

def g = new Neo4jGraph('/tmp/neo4j');
def roots = g.V[0..10]; // some subset of V
assert local(roots).size() <= g.V.count();

With local rank metrics, a set of root vertices serves as the source of the ranking. All vertices in the graph (or some subset thereof as not all vertices may be reachable or close enough to be touched) are then ranked relative to this root set. When this root set contains only one vertex, this is known as an ego-rank. When this root set contains all vertices, this is know as a global rank. Thus, the concept of local rank is fuzzy.

An interesting class of local rank metrics are the recommendation metics. For example, a website might have many users and many items. Users may like items. When a particular user is logged in, a recommendation can occur which ranks the items in the graph relative to the user. Given that the recommendation is relative to a single user vertex, the algorithm is an ego-rank. Basic collaborative filtering is an example of a local, ego-rank metric. The Gremlin snippet below ranks/recommends items relative to user vertex 1 using basic collaborative filtering.

def m = new HashMap<Vertex,Double>()
g.v(1).outE('likes').inV.inE('likes').outV.outE('likes').inV.groupCount(m);

In English, the code says get vertex 1 from the graph, determine which items user 1 likes, determine which users also like those liked items, determine which items those users like, then finally, rank those items by how many times they are traversed. Thus, a ranking is returned that is rooted at vertex 1.

There is also the concept of local centrality. That is, determine which vertices are most central to some root set of vertices. The classic algorithm of this nature is spreading activation. With spreading activation, a set of root vertices is “stimulated with energy” and that energy emanates from those roots as it diffuses over the edges of the graph. The vertices with the most energy flow over some number of steps are considered the most central relative to the root set. A theoretically interesting instance of the spreading activation algorithm is PageRank Priors which is discussed in the White and Smyth article previous mentioned.

Conclusion

Most of the discussion on graph ranking is focused on global rank metrics. However, in many situations, local metrics are both more useful and more efficient. They are more useful in situations such as personalization. They are more efficient in that they do not require a full traversal of the graph and are usually completed in only a few hops out from the root set.

If you use Gremlin and your code has the form

g.V.something.groupCount(m).loop{ }

then you are doing a global rank algorithm. If your Gremlin code has the form

g.v(1).something.groupCount(m).loop{ }

then you are doing a local rank algorithm. Both types of ranking algorithms have their place and are generally useful for distilling a complex structure into a ranked list of numbers. In other words, both are helpful at making sense of the graph.

References

White, S., Smyth, P., Algorithms for Estimating Relative Importance in Networks, Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), August 24–27, 2003.

Graphs Ensure Something from Nothing

There is a reason that there is something as opposed to nothing. However, it is simple for something to be nothing if that something is but one thing. One thing could be nothing even if it is not no thing. For that one thing must be both that which is called something and nothing.

To not be one thing, something must be some things. These things sum to something and make something not nothing as now there are some things which, as it stands, are not at all no thing.


To be things, things must not simply be, but instead be things in reference to some things. For a thing that simply is is simply one thing and that one thing is some thing called nothing.



Related things is nothing but some things related. However, things related is something that can be related by some thing. As such, some related things of the related something stave the fact that it is all simply nothing.

The more there are some things, the further something is from nothing…

TinkerPop as of Spring 2011

TinkerPop is an open source software development group focused on technologies in the emerging graph database ecosystem. The group started in the Fall of 2009 and has been actively pushing out technologies since. This post highlights the current state of some of the more popular products made freely available as of Spring 2011.

Blueprints: A Property Graph Model Interface

Blueprints was one of the first TinkerPop products. There are numerous graph database providers. The usual suspects include Neo4j, OrientDB, DEX, InfiniteGraph, Sones, HyperGraphDB, and others. Each database has its own particular graph data model and tools for their respective manipulation and analysis. Blueprints is an attempt to provide a common API. In this way, any tool written to the Blueprints API can work over various graph database providers. The hope is to prevent vendor lock-in as well as make software that is generally useful to developers in the graph scene. Blueprints is relatively simple to implement and has few strict requirements of the underlying graph database. There are some conveniences provided that include: representing an RDF store as a graph database, representing a graph database as an RDF store, support for exposing a graph database as a JUNG graph, etc. The future of the project includes more implementations by more graph database vendors and the development of more utilities and tools that make use of the Blueprints API.

Pipes: A Dataflow Framework using Process Graphs

Pipes was a later development in the TinkerPop suite. The purpose of Pipes is to provide atomic operations that are commonly used when doing a graph traversal. It was realized that every property graph analysis algorithm could be represented by a chain of operations that are an instance of one of the following categories.

  1. Transform: Take an object and turn it into another object. For example, given a vertex, return its outgoing edges.
  2. Filter: Take an object and decide whether to return it or not. For example, given an edge, should it be traversed?
  3. Side Effect: Take an object and return it. However, yield some side effect in the process. For example, count the number of times a particular vertex has been seen.

Moreover, such operations can be lazily evaluated in many cases and have the convenient property of being able to be chained together using function composition.

Pipes has come to serve as a foundation for the development of property graph traversal algorithms and languages.

Gremlin: A Graph Traversal Language

Gremlin was one of the primary reasons why TinkerPop was formed. This is apparent when realizing that TinkerPop’s mailing list is called Gremlin Users. The beauty of a property graph (and multi-relational graphs in general) is that it can be traversed in numerous ways. There are many paths through a graph when one can filter on edges, check properties, update a counter, check a counter, jump back 2 steps, etc. As such, there is a need for a language that can concisely express such traversals. Gremlin is one such language. Gremlin is a DSL written in Groovy that compiles down to Pipes. In short, Gremlin is a language for constructing data flow pipelines to traverse a graph.

Rexster: A RESTful Graph Shell

Rexster was thought of when TinkerPop was first formed, but only came to be a mature product much later. Rexster exposes any Blueprints-enabled graph database as a REST-based server using Jersey (deployable through Tomcat). With Rexster, developers resolve semantically rich URLs and, in return, get JSON formatted responses. There is also a JavaScript-based web administration tool called “The Dog House.” It allows users to visually browse a graph and to interact with an embedded Gremlin console.

The Future of TinkerPop

There are many projects that get added and then removed from the TinkerPop suite. What sounded like a good idea at one point, loses its steam as more is learned. However, one project in the works is Frames and it just might stand the test of time. Frames provides a schema-layer to any Blueprints-enabled graph database. In contrast, RDF stores have rich schema layers via RDFS and OWL. As of today, the graph database scene relies primarily on the semantics of the high-level languages that the graph database was developed in. Frames hopes to free this constraint by allowing developers to easily create schemas with varied semantics and rules of inference.

TinkerPop is a relatively new software development group. The team has grown over the last 2 years and with each new member comes new expertise. Hopefully more developers join and help to contribute to the wonderful world of graphs.

Knowledge Representation and Reasoning with Graph Databases

A graph database and its ecosystem of technologies can yield elegant, efficient solutions to problems in knowledge representation and reasoning. To get a taste of this argument, we must first understand what a graph is. A graph is a data structure. There are numerous types of graph data structures, but for the purpose of this post, we will focus on a type that has come to be known as a property graph. A property graph denotes vertices (nodes, dots) and edges (arcs, lines). Edges in a property graph are directed and labeled/typed (e.g. “marko knows peter”). Both vertices and edges (known generally as elements) can have any number of key/value pairs associated with them. These key/value pairs are called properties. From this foundational structure, a suite of questions can be answered and problems solved.

Object Modeling

The property graph data structure is nearly identical in form to the object graphs of object oriented programming. Take a collection of objects, remove their methods, and you are left with a property graph. An object’s fields are either primitive and in which cases serve as properties or they are complex and in which case serve as references to other objects. For example, in Java:

class Person {
  String name;
  Integer age;
  Collection<Person> knows;
}

The name and age properties are vertex properties of the particular person instance and the knows property refer to knows-labeled edges to other people. Emil Eifrem of Neo Technology espouses the view that property graphs are “whiteboard friendly” as they are aligned with the semantics of modern object oriented languages and the diagramming techniques used by developers. A testament to this idea is the jo4neo project by Taylor Cowan. With jo4neo, Java annotations are elegantly used to allow for the backing of a Java object graph by the Neo4j graph database. Beyond the technological benefits, the human mind tends to think in terms of objects and their relations. Thus, graphs may be considered “human brain friendly” as well.

Given an object graph, questions can be answered about the domain. In the graph traversal DSL known as Gremlin, we can ask questions of the object graph:

// Who does Marko know?
marko.outE('knows').inV

// What are the names of the people that Marko knows?
marko.outE('knows').inV.name

// What are the names and ages of the people that Marko knows?
marko.outE('knows').inV.emit{[it.name, it.age]}

// Who does Marko know that are 30+ years old?
marko.outE('knows').inV{it.age > 30}

Concept Modeling

From the instances that compose a model, there may exist abstract concepts. For example, while there may be book instances, there may also be categories for which those books fall–e.g. science fiction, technical, romance, etc. The graph is a flexible structure in that it allows one to express that something is related to something else in some way. These somethings may be real or ethereal. As such, ontological concepts can be represented along with their instances and queried appropriately to answer questions.

// What are the parent categories of history?
x = []; history.inE('subCategory').outV.aggregate(x).loop(3){!it.equals(literature)}; x

// How many descendant categories does fiction have?
c = 0; fiction.outE('subCategory').inV.foreach{c++}.loop(3){true}; c

// Is romance at the same depth as history?
c = 0; romance.inE('subCategory').outV.loop(2){c++; !it.equals(literature)}.outE('subCategory').inV.loop(2){c--; !it.equals(history)}; c == 0

Automated Reasoning


From the explicit objects, their relationships, and their abstract categories, reasoning processes can be enacted. A tension that exists in graph modeling is what to make explicit (structure) and what to infer through traversal (process). The trade-off is between, like much of computing, space and time. If there exists an edge from a person to their coauthors, then its a single hop to get from that person to his or her coauthors. If, on the other hand, coauthors must be inferred through shared writings, then a multi-hop step is computed to determine coauthors. Reasoning is the process of making what is implicit explicit. A couple simple reasoning examples are presented below using Gremlin.

// Two people who wrote the same book/article/etc. are coauthors
g.V{x = it}.outE('wrote').inV.inE('wrote').outV.except([x])[0].foreach{g.addEdge(null, x, it, 'hasCoauthor')}

// People who write literature are authors
author = g.addVertex(); author.type='role'; author.name='author'
g.V.foreach{it.outE('wrote').inV[0].foreach{g.addEdge(null, it, author, 'hasRole')} >> -1}

In the examples above, a full graph analysis is computed to determine all coauthors and author roles. However, nothing prevents the evaluation of local inference algorithms.

// Marko's coauthors are those people who wrote the same books/articles/etc. as him
marko.outE('wrote').inV.inE('wrote').outV.except([marko])[0].foreach{g.addEdge(null, x, it, 'hasCoauthor')}

Conclusion

Graphs are useful for modeling objects, their relationships to each other, and the conceptual structures wherein which they lie. From this explicit information, graph query and inference algorithms can be evaluated to answer questions on the graph and to increase the density of the explicit knowledge contained within the graph (i.e. increase the number of vertices and edges). This particular graph usage pattern has been exploited to a great extent in the world of RDF (knowledge representation) and RDFS/OWL (reasoning). The world of RDF/RDFS/OWL is primarily constrained to description logics (see an argument to the contrary here). Description logics are but one piece of the larger field of knowledge representation and reasoning. There are numerous logics that can be taken advantage of. In the emerging space of graph databases, the necessary building blocks exist to support the exploitation of other logics. Moreover, these logics, in some instances, may be used concurrently within the same graphical structure. To this point, the reading list below provides a collection of books that explicate different logics and ideas regarding heterogeneous reasoning. Graph databases provide a green field by which these ideas can be realized.

Further Reading

Brachman, R., Levesque, H., “Knowledge Representation and Reasoning,” Morgan Kaufmann, 2004.

Wang, P., “Rigid Flexibility: The Logic of Intelligence,” Springer, 2006.

Mueller, E.T., “Commonsense Reasoning,” Morgan Kaufmann, 2006.

Minsky, M., “The Society of Mind,” Simon & Schuster, 1988.

Property Graph Algorithms

The term property graph has come to denote an attributed, multi-relational graph. That is, a graph where the edges are labeled and both vertices and edges can have any number of key/value properties associated with them. An example of a property graph with two vertices and one edge is diagrammed below.

Property graphs are more complex than the standard single-relational graphs of common knowledge. The reason for this is that there are different types of vertices (e.g. people, companies, software) and different types of edges (e.g. knows, works_for, imports). The complexities added by this data structure (and multi-relational graphs in general, e.g. RDF graphs) effect how graph algorithms are defined and evaluated.

Standard graph theory textbooks typically present common algorithms such as various centralities, geodesics, assortative mixings, etc. These algorithms usually come pre-packaged with single-relational graph toolkits and frameworks (e.g. NetworkX, iGraph).

It is common for people to desire such graph algorithms when they begin to work with property graph software. I have been asked many times:

“Does the property graph software you work on support any of the common centrality algorithms? For example, PageRank, closeness, betweenness, etc.?”

My answer to this question is always:

“What do you mean by centrality in a property graph?”

When a heterogeneous set of vertices can be related by a heterogeneous set of edges, there are numerous ways in which to calculate centrality (or any other standard graph algorithm for that matter).

  1. Ignore edge labels and use standard single-relational graph centrality algorithms.
  2. Isolate a particular “slice” of the graph (e.g. the knows subgraph) and use standard single-relational graph centrality algorithms.
  3. Make use of abstract adjacencies to compute centrality with higher-order semantics.

The purpose of this blog post is to stress point #3 and the power of property graph algorithms. In Gremlin, you can calculate numerous eigenvector centralities for the same property graph instance. At this point, you might ask: “How can a graph have more than one primary eigenvector?” The answer lies in seeing all the graphs that exist within the graph—i.e. seeing all the higher-order, derived, implicit, virtual, abstract adjacencies. Each line below exemplifies point #1, #2, and #3 in the list above, respectively. The code examples use the power method to calculate the vertex centrality rankings which are stored in the map m.

g.V.outE.inV.groupCount(m).loop(3){c++ < 10000} // point #1
g.V.outE[[label:'knows']].inV.groupCount(m).loop(4){c++ < 10000} // point #2
g.V.???.groupCount(m).loop(?){c++ < 10000} // point #3

The ??? on line 3 refers to the fact that ??? can be any arbitrary computation. For example, ??? can be:

outE[[label:'works_for']].inV.inE[[label:'works_for']].outV
outE[[label:'works_for']].inV[[name:'ACME']].inE[[label:'works_for']].outV
outE[[label:'develops']].inV.outE[[label:'imports']].inV[[name:'Blueprints']].back(7).outE[[label:'works_for']].inV.inE[[label:'works_for']].outV.outE[[label:'develops']].inV.outE[[label:'imports']].inV[[name:'Blueprints']].back(7)

The above expressions have the following meaning:

  1. Coworker centrality
  2. ACME Corporation coworker centrality
  3. Coworkers who import Blueprints into their software centrality

There are numerous graphs within the graph. As such, “what do you mean by centrality?”

These ideas are explored in more detail in the following article and slideshow.

Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis Algorithms,” Journal of Informetrics, 4(1), pp. 29-41, Elsevier, doi:10.1016/j.joi.2009.06.004, 2009.

Gremlin 0.7 Released

Gremlin is a domain specific language for traversing graphs. Gremlin 0.7 (Gremopoly) is a major improvement over Gremlin 0.6 and below. Gremlin 0.7 makes use of Groovy as its host language. Given that Groovy is a superset of Java, it is possible to seamlessly integrate with Blueprints, Pipes, and the standard Java JDK. There is no more awkward disjoint between Gremlin and the engine it is built upon — namely Java and TinkerPop.

Here is a collection of Gremlin 0.7 code snippets to provide a flavor of the language.

    • What are vertex 1′s friends’ names?
  • g.v(1).outE[[label:'friend']].inV.name
    
    • What are vertex 1′s friends’ friends’ names who are not vertex 1′s friends?
  • x = []
    g.v(1).outE[[label:'friend']].inV.aggregate(x)
       .outE[[label:'friend']].inV.except(x).name
    
    • What are the names of vertex 1′s friends who are friends with vertex 2?
  • g.v(1).outE[[label:'friend']].inV
       .outE[[label:'friend']].inV[[id:2]].back(4).name
    
    • How many steps away from vertex 1 is vertex 2 in the friendship subgraph?
  • s = 0; r = true; // s will be solution
    g.v(1).outE[[label:'friend']].inV.loop(3)
       {if(r & it.object==g.v(2)) {s=it.loops; r=false}; r}
    

    These basic examples touch at what is possible when maneuvering a graph with Gremlin. As the graphical domain model becomes more complex and the way in which vertices relate to each other yield increasingly sophisticated abstract relationships, Gremlin becomes an indispensable tool for arbitrary property graph traversal, query, and ultimately, problem-solving.

    For those that process RDF, Gremlin provides connectivity to the OpenRDF Sail framework (see Sail and Blueprints). Moreover, with Gremlin, you can traverse the Web of Data in real-time.

    The move to Groovy as Gremlin’s host language will speed up the development and ease the maintenance of Gremlin. This is a big step for all of us that love to traverse graphs.

    Follow

    Get every new post delivered to your Inbox.

    Join 127 other followers