Skip to content

Exploring Wikipedia with Gremlin Graph Traversals

March 7, 2012

A graph is a data structure composed of vertices and edges. These terms are synonymous with the following:

  • vertices, dots, nodes, things, points
  • edges, lines, links, relations, arcs

From vertices and edges, numerous artificial and real-world systems can be modeled and processed. The purpose of this post is to explore a graph representation of Wikipedia persisted in the graph database Neo4j and processed using the graph traversal language Gremlin. Please note that a parse-friendly representation of Wikipedia was provided by DBPedia and the present analysis and presentation is provided by a collaborative effort between Aurelius and the Intelligent Systems Laboratory at PARC (a Xerox company).

On the Graph Nature of Wikipedia

There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

Using Gremlin, some descriptive graph statistics of this representation are calculated. Along with vertex and edge counts, the in- and out-degree distributions of the graph are provided below. These distribution calculations are agnostic to the edge labels.

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = new Neo4jGraph('/data/dbpedia')
==>neo4jgraph[/data/dbpedia]
gremlin> g.V.count()
==>30962172
gremlin> g.E.count()
==>191767951 
gremlin> outDegrees = [:]; inDegrees = [:]
gremlin> g.V.transform{it.out.count()}.groupCount(outDegrees).iterate()
gremlin> g.V.transform{it.in.count()}.groupCount(inDegrees).iterate()
gremlin> f = new File('out-degrees.txt')
gremlin> outDegrees.each{k,v -> f.append(k + '\t' + v + '\n')}
gremlin> f = new File('in-degrees.txt')
gremlin> inDegrees.each{k,v -> f.append(k + '\t' + v + '\n')}

The files out-degrees.txt and in-degrees.txt generated by Gremlin are loaded into R Statistics and plotted.

R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing

> out.degrees <- read.table('out-degrees.txt', sep='\t')
> in.degrees <- read.table('in-degrees.txt', sep='\t')
> plot(out.degrees, log='xy', main='Wikipedia Out-Degree Distribution', xlab='Out Degree', ylab='Frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
> plot(in.degrees, log='xy', main='Wikipedia In-Degree Distribution', xlab='In Degree', ylab='Frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

On the Referential Nature of Ideas

One lazy Sunday afternoon, Gremlin walks into his library. The smell of leather and mahogany overcomes his senses. His eyes scan over the shelves filled with books about every imaginable topic. On one inspired day many moons ago, Gremlin printed out each Wikipedia article, bound it, and stored it. On this particular occasion, Gremlin decides to review one of his favorite subjects: graph theory. Off the shelf, Gremlin grabs the book respectfully entitled “Graph Theory.”

gremlin> v = g.idx(T.v)[[uri:'http://dbpedia.org/resource/Graph_theory']].next()
==>v[26634]
gremlin> v.map()
==>name=graph theory
==>uri=http://dbpedia.org/resource/Graph_theory

Gremlin cracks open the cover of Graph Theory. To his amazement, he notices that the words in the book reference other books on his shelf! [line 1] He wonders, how many such references exist? [line 7]

gremlin> v.out('href').name[0..4]
==>mathematics
==>computer science
==>graph (mathematics)
==>vertex (graph theory)
==>graph of a function
gremlin> v.out('href').count()
==>240

One reference at a time, Gremlin pulls the respective book from its shelf and looks at its pages. Unbelievably, these books also make reference to other books! [line 1] How many references do the references of Graph Theory have? [line 8] Dumbfounded by the circuitous nature of ideas being defined in terms of other ideas, Gremlin comes to realize the inherent structure of the graph of knowledge — its self-referential nature [line 9].

gremlin> v.out('href').out('href').name[0..4]
==>mathematics (disambiguation)
==>math (disambiguation)
==>euclid
==>greek language
==>quantity
gremlin> v.out('href').out('href').count()   
==>21182
gremlin> v.out('href').dedup.out('href').out('href').retain([v]).paths{it.name}[0..4]
==>[graph theory, mathematics, logic, graph theory]
==>[graph theory, mathematics, history of mathematics, graph theory]
==>[graph theory, mathematics, operations research, graph theory]
==>[graph theory, mathematics, paul erdős, graph theory]
==>[graph theory, mathematics, theoretical computer science, graph theory]

There is only so much time a little green Gremlin gets to live. He wonders what he would be expert in if he continues to read book after book as dictated by the references that they maintain to one another.

gremlin> v.in('href').name[0..4]
==>semiotics of the structure
==>sones graphdb
==>trapezoid graph
==>alpha centrality
==>block graph

On the Recurrent Nature of Thought

For hours upon hours and days upon days, Gremlin goes from book to book. Sometimes coming back to reading a previous book over again. From the vantage point of a long exposure film, his form looks like a blur amongst his collection. However, this blur is not evenly distributed across his library. In fact, he is more likely reading some books over others. Those books that are most central in the hyperlink structure are those most likely to be encountered on Gremlin’s random walk. In his ledger, Gremlin decides to record the potential that he will be reading any one book at any one time. Initially all pages are equally likely. [line 1] He begins his journey at Graph Theory. [line 2] He looks up how much “potential” exists for Graph Theory. [line 3] He then retrieves all the outgoing neighbors of Graph Theory. [line 4] He determines the total number of neighbors. [line 5] To each neighbor, he distributes Graph Theory’s potential equally amongst them. [lines 6--8] Gremlin then jumps to Graph Theory’s referred to books and continues the process over and over and over again for 100,000 steps. Starting from the Graph Theory vertex, Gremlin diffuses himself like a wave over the structure that surrounds.

gremlin> m = [:].withDefault{1}
gremlin> v.transform{
  rank = m[it.name]; 
  neighbors = it.out('href').toList();
  degree = neighbors.size();
  neighbors.each {
    m[it.name] = m[it.name] + (rank/degree);
  }
  neighbors;
}.scatter.range(0,100000).loop(3){true}.iterate()
==>null

Gremlin looks over the top 10 ranked entries in his ledger. This is where his diffusion shows the most recurrence — it articulates which book he is most likely to be reading at any one moment in the next 100,000 book reads.

gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>mathematics=8.2325984930913507177010807338237871640327644591
==>graph theory=6.3579261869065686025097269665928272601625649213
==>graph (mathematics)=3.8056391897097907459393869524402476473212892533
==>computer science=3.2189950001323052861187247709061777261390151631
==>real number=2.942860316400352223267769366106131293745859000
==>mathematician=2.751409475433875926984889918863144276736916
==>number theory=2.7296994022747017139496107107489077766999403538
==>combinatorics=2.6983273156966484352429094611889032326648909856
==>topology=2.6822253369205954921236082679389161746453858919
==>set theory=2.5682887400468326671894663603540368070624776875

On the Taxonomical Nature of Categories

Gremlin’s library is not just shelves with books. His library also has a card catalog in it. The catalog indexes books by category. Moreover, each book has a reference to its subject category. For instance, the subject category of Graph Theory is, well…’graph theory.’ The catalog is also structured such that the categories make reference to one another. For instance, ‘combinatorics’ is a broader category than ‘graph theory.’ For the fun of it, Gremlin decides to move between the books and the categories.

gremlin> v.out('subject').name
==>graph theory
gremlin> v.out('subject').out('broader').name
==>combinatorics
==>mathematics

In order to see the layers in a breadth-first manner, Gremlin exposes the category “rings” emanating from ‘graph theory.’

gremlin> x = [];
gremlin> v.out('subject').out('broader').gather{t = it.unique(); x.add(t*.name); t}.scatter.loop(3){it.loops < 7}.iterate()
==>null
gremlin> x
==>[combinatorics, mathematical relations]
==>[subdivisions of mathematics, discrete mathematics, mathematics, predicate logic]
==>[mathematics, subfields by academic discipline, subdivisions of mathematics, main topic classifications, abstraction, formal sciences, structure, scientific disciplines, systems of formal logic]
==>[main topic classifications, abstraction, formal sciences, structure, scientific disciplines, categories by topic, academic disciplines, mathematics, subfields by academic discipline, articles, thought, innovation, problem solving, creativity, analogy, concepts, logic, interdisciplinary fields, formalism (deductive), matter, dimension, form, science, mathematical logic, formal systems]
==>[articles, thought, structure, innovation, problem solving, creativity, analogy, concepts, academic disciplines, logic, interdisciplinary fields, scientific disciplines, formalism (deductive), matter, dimension, form, science, subfields by academic discipline, categories by parameter, academia, categories by topic, main topic classifications, abstraction, formal sciences, contents, mind, cognition, mental processes, mental content, technological change, intelligence, critical thinking, human skills, human behavior, action, quality, fundamental categories, branches of philosophy, axiology, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, subdivisions of mathematics, discrete mathematics, mathematical logic, metalogic, formal theories, logical syntax]
==>[contents, mind, cognition, mental processes, mental content, matter, concepts, dimension, form, technological change, intelligence, creativity, innovation, critical thinking, human skills, human behavior, action, quality, thought, fundamental categories, academia, categories by topic, abstraction, branches of philosophy, interdisciplinary fields, axiology, science, academic disciplines, subfields by academic discipline, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, main topic classifications, categories, education, categories by parameter, articles, structure, problem solving, analogy, logic, scientific disciplines, formalism (deductive), cognitive science, concepts in metaphysics, philosophy of mind, consciousness, epistemology, psychology, technology, economic development, technological change, and growth, neurology, neuroscience, memory, learning, educational psychology, philosophical logic, evaluation, life skills, skills, humans, behavior, phenomena, determinism, free will, motion, management, philosophy by field, value, deduction, scientific theories, philosophical theories, philosophy of mathematics, physics, concepts by field, physical cosmology, astrophysics, astronomical dynamical systems, reality, everything, life, metaphysics, philosophy of science, topological spaces, differential geometry, geometric topology, measurement, information, perception, concepts in epistemology, mathematics, subdivisions of mathematics, discrete mathematics, metaphilosophy, metatheory, theories, formal languages, formal sciences, syntax]

These rings are useful for seeing what is touched at each step of the walk. However, in order to see the inverted tree structure, Gremlin nests his traversal so that the categories are grouped under their children. Gremlin records the subsumption hierarchy as categories broaden one another.

gremlin> v.out('subject').out('broader').groupBy{it.name}{
  it.out('broader').groupBy{it.name}{
    it.out('broader').name
  }.cap.next()
}.cap.next()
==>mathematical relations=[{mathematics=[main topic classifications, abstraction, formal sciences, structure, scientific disciplines], predicate logic=[systems of formal logic]}]
==>combinatorics=[{discrete mathematics=[subdivisions of mathematics], subdivisions of mathematics=[mathematics, subfields by academic discipline]}]

Gremlin’s library is an adventure through the fascinating world of interconnected ideas.

References

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, S., Cyganiak, R., Hellmann, S., “DBpedia – A Crystallization Point for the Web of Data,” Journal of Web Semantics, 7(3), 2009.

Rodriguez, M.A., Pepe, A., Shinavier, J., “The Dilated Triple,” Emergent Web Intelligence: Advanced Semantic Technologies, Advanced Information and Knowledge Processing series, 3–16, 2010.

Newman, M.E.J., “The Structure and Function of Complex Networks“, SIAM Review, 45, 167–256, 2003.

Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L.M.A, Chute, R., Rodriguez, M.A., Balakireva, L.L., “Clickstream Data Yields High-Resolution Maps of Science,” PLoS One, Public Library of Science, 4(3), e4803, 2009.

From → Graph Theory, Gremlin

One Comment

Trackbacks & Pingbacks

  1. Exploring Wikipedia with Gremlin Graph Traversals « Another Word For It

Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 140 other followers

%d bloggers like this: