A Graph-Based Movie Recommender Engine

A recommender engine helps a user find novel and interesting items within a pool of resources. There are numerous types of recommendation algorithms and a graph can serve as a general-purpose substrate for evaluating such algorithms. This post will demonstrate how to build a graph-based movie recommender engine using the publicly available MovieLens dataset, the graph database Neo4j, and the graph traversal language Gremlin. Feel free to follow along in the Gremlin console as the post will go step-by-step from data acquisition, to parsing, and ultimately, to traversing.
The MovieRatings Dataset
The GroupLens research group has made available a corpus of movie ratings. There are 3 versions of this dataset: 100 thousand, 1 million, and 10 million ratings. This post makes use of the 1 million ratings version of the dataset. The dataset can be downloaded from the MovieRatings website (~6 megs in size). The raw dataset is composed of three files: users.dat, movies.dat, and ratings.dat. More information about the particulars of each file is described in the respective README.txt.
Getting Started with Gremlin
All of the code examples can be cut and pasted into the Gremlin console or into a Groovy/Java class within a larger application. For the sake of simplicity, simply follow along with the Gremlin console. Gremlin 1.3 is available for download at this location.
marko$ ./gremlin.sh
\,,,/
(o o)
-----oOOo-(_)-oOOo-----
gremlin>
Generating a MovieRatings Graph

Before getting recommendations of which movies to watch, it is important to first parse the raw MovieLens data according to the graph schema defined above. While this is not the only way in which the data could be represented as a property graph, the representation above is natural and useful for the traversals presented later.
The data will be inserted into the graph database Neo4j. The Gremlin/Groovy code below creates a new Neo4j graph, removes an unneeded default edge index, and sets the transaction buffer to 1000 mutations per commit. Finally, to make the data easier to understand, a map is hardcoded which will allow for the representation of a user’s occupation as a string instead of its integer id.
g = new Neo4jGraph('/tmp/movieRatings')
g.dropIndex("edges")
g.setMaxBufferSize(1000)
occupations = [0:'other', 1:'academic/educator', 2:'artist',
3:'clerical/admin', 4:'college/grad student', 5:'customer service',
6:'doctor/health care', 7:'executive/managerial', 8:'farmer',
9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer',
13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed',
17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer']
Parsing Movie Data
The file movie.dat contains a list of movies. Each row of the raw file has 3 columns: movieId, title, and generas.
marko$ more -n6 movies.dat 1::Toy Story (1995)::Animation|Children's|Comedy 2::Jumanji (1995)::Adventure|Children's|Fantasy 3::Grumpier Old Men (1995)::Comedy|Romance 4::Waiting to Exhale (1995)::Comedy|Drama 5::Father of the Bride Part II (1995)::Comedy
The code to parse this data into Neo4j and according to the diagrammed schema is presented below. The details of the code are left to focused eye of the interested reader. For others, simply cutting and pasting this code into the Gremlin console will suffice to move forward.
new File('movies.dat').eachLine {def line ->
def components = line.split('::');
def movieVertex = g.addVertex(['type':'Movie', 'movieId':components[0].toInteger(), 'title':components[1]]);
components[2].split('\\|').each { def genera ->
def hits = g.idx(T.v)[[genera:genera]].iterator();
def generaVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Genera', 'genera':genera]);
g.addEdge(movieVertex, generaVertex, 'hasGenera');
}
}
Parsing User Data
The file users.dat contains a list of users. Each row of the raw file has 5 columns: userId, gender, age, occupation, and zipcode. For schema simplicity, the zipcode field will be ignored.
ml-1m$ more -n6 users.dat 1::F::1::10::48067 2::M::56::16::70072 3::M::25::15::55117 4::M::45::7::02460 5::M::25::20::55455
new File('users.dat').eachLine {def line ->
def components = line.split('::');
def userVertex = g.addVertex(['type':'User', 'userId':components[0].toInteger(), 'gender':components[1], 'age':components[2].toInteger()]);
def occupation = occupations[components[3].toInteger()];
def hits = g.idx(T.v)[[occupation:occupation]].iterator();
def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Occupation', 'occupation':occupation]);
g.addEdge(userVertex, occupationVertex, 'hasOccupation');
}
Parsing Ratings Data
The file ratings.dat contains the ratings that a user provided for the movies they watched. Each row of the raw file has 4 columns: userId, movieId, stars (1-5 rating scale), and timestamp. The timestamp field will be ignored.
ml-1m$ more -n6 ratings.dat 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 1::3408::4::978300275 1::2355::5::978824291
Given that there are 1 million ratings, this code fragment will take a minute or so to evaluate. For those dealing with large datasets and Neo4j, its possible to use Neo4jBatchGraph which is new with Blueprints 1.0.
new File('ratings.dat').eachLine {def line ->
def components = line.split('::');
def ratedEdge = g.addEdge(g.idx(T.v)[[userId:components[0].toInteger()]].next(), g.idx(T.v)[[movieId:components[1].toInteger()]].next(), 'rated');
ratedEdge.setProperty('stars', components[2].toInteger());
}
To commit any data left over in the transaction buffer, successfully stop the current transaction. Now the data is persisted to disk. If you plan on leaving the Gremlin console, be sure to g.shutdown() the graph first.
g.stopTransaction(TransactionalGraph.Conclusion.SUCCESS)
Before moving onto recommender algorithms, lets make sure the graph looks correct.
gremlin> g.V.count() ==>9962 gremlin> g.E.count() ==>1012657 gremlin> g.V[[type:'Movie']].count() ==>3883 gremlin> g.V[[type:'Genera']].count() ==>18 gremlin> g.V[[type:'User']].count() ==>6040 gremlin> g.V[[type:'Occupation']].count() ==>21 gremlin> occupations.size() ==>21
As a side: What is the distribution of occupations amongst the user population?
gremlin> m = [:]
gremlin> g.V[[type:'User']].out('hasOccupation').occupation.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}
==>college/grad student=759
==>other=711
==>executive/managerial=679
==>academic/educator=528
==>technician/engineer=502
==>programmer=388
==>sales/marketing=302
==>writer=281
==>artist=267
==>self-employed=241
==>doctor/health care=236
==>K-12 student=195
==>clerical/admin=173
==>scientist=144
==>retired=142
==>lawyer=129
==>customer service=112
==>homemaker=92
==>unemployed=72
==>tradesman/craftsman=70
==>farmer=17
What about the average age?
gremlin> g.V[[type:'User']].age.mean() ==>30.639238410596025
Traversing the MovieLens Graph
Now that the MovieLens data has been parsed and represented as a graph, the data is ready to be traversed (i.e. queried). There are two general types of recommendation: collaborative filtering and content-based recommendation.
In collaborative filtering, the rating/liking/preference behavior of users is correlated in order to recommend the favorites of one user to another, similar user.
I like the movies you like, what other movies do you like that I haven’t seen?
With content-based recommendation, if a particular item is liked, then its features are analyzed in order to find other items with analogous features.
I like romantic zombie comedies, what other romantic zombie comedies are there?
Basic Collaborative Filtering
This section will build up an ever more complex traversal in order to explain the foundation of graph-based collaborative filtering and hint at the variations that can be incorporated to meet domain specific requirements. Lets start from Toy Story the movie.
gremlin> v = g.idx(T.v)[[title:'Toy Story (1995)']] >> 1 ==>v[1] gremlin> v.map() ==>movieId=1 ==>title=Toy Story (1995) ==>type=Movie
- Which users gave Toy Story more than 3 stars? (only return 5 results)
gremlin> v.inE('rated').filter{it.getProperty('stars') > 3}.outV.userId[0..4]
==>v[3902]
==>v[3912]
==>v[3916]
==>v[3918]
==>v[3920]
This traversal doesn’t yield useful information. However, when it is used within a larger path expression, collaborative filtering is effected.

- Which users gave Toy Story more than 3 stars and what other movies did they give more than 3 stars to? (only return 5 results)
gremlin> v.inE('rated').filter{it.getProperty('stars') > 3}.outV.outE('rated').filter{it.getProperty('stars') > 3}.inV.title[0..4]
==>One Flew Over the Cuckoo's Nest (1975)
==>Erin Brockovich (2000)
==>Bug's Life, A (1998)
==>Ben-Hur (1959)
==>Christmas Story, A (1983)
- Start from Toy Story — v
- Get the incoming rated edges — inE(‘rated’)
- Filter out those edges whose star property is less than 4 — filter{it.getProperty(‘stars’) > 3}
- Get the tail user vertices of the remaining edges — outV
- Get the rating edges of those user vertices — outE(‘rated’)
- Filter out those edges whose star property is less than 4 — filter{it.getProperty(‘stars’) > 3}
- Get the head movie vertices of the remaining edges — inV
- Get the string title property of those movie vertices — title
- Only emit the first 5 titles — [0..4]
What is the meaning of the aggregate of these atomic-steps? — highly co-rated. With Gremlin, it is possible to work at a higher level of abstraction by making use of a user defined steps. The step defined below is called corated and it bundles all these atomic-steps together.
gremlin> Gremlin.defineStep('corated',[Vertex,Pipe], { def stars ->
_().inE('rated').filter{it.getProperty('stars') > stars}.outV.outE('rated').filter{it.getProperty('stars') > stars}.inV})
==>null
gremlin> v.corated(3).title[0..4] ==>One Flew Over the Cuckoo's Nest (1975) ==>Erin Brockovich (2000) ==>Bug's Life, A (1998) ==>Ben-Hur (1959) ==>Christmas Story, A (1983)
This traversal will return a list of movies. Notice that this list has many duplicates. This is due to the fact that users who like Toy Story also like many of the same other movies. This similarity of human behavior is what is capitalized on in recommendation algorithms.
- How many of Toy Story’s highly co-rated movies are unique?
gremlin> v.corated(3).count() ==>268493 gremlin> v.corated(3).uniqueObject.count() ==>3353
Given that there are 268,493 highly rated paths from Toy Story to other movies and only 3,353 of those movies are unique, it is possible to use these duplicates as a ranking mechanism–ultimately, a recommendation.
- Which movies are most highly co-rated with Toy Story? (return the top 10)
gremlin> m = [:]
gremlin> v.corated(3).title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>Toy Story (1995)=1655
==>Star Wars: Episode V - The Empire Strikes Back (1980)=1000
==>Star Wars: Episode IV - A New Hope (1977)=998
==>American Beauty (1999)=949
==>Matrix, The (1999)=925
==>Raiders of the Lost Ark (1981)=922
==>Silence of the Lambs, The (1991)=887
==>Saving Private Ryan (1998)=878
==>Back to the Future (1985)=876
==>Shawshank Redemption, The (1994)=875
The highest ranked result makes sense. It says, people who like Toy Story, also like Toy Story. In order to remove these reflexive paths, simply filter out Toy Story.
gremlin> m = [:]
gremlin> v.corated(3).filter{it != v}.title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>Star Wars: Episode V - The Empire Strikes Back (1980)=1000
==>Star Wars: Episode IV - A New Hope (1977)=998
==>American Beauty (1999)=949
==>Matrix, The (1999)=925
==>Raiders of the Lost Ark (1981)=922
==>Silence of the Lambs, The (1991)=887
==>Saving Private Ryan (1998)=878
==>Back to the Future (1985)=876
==>Shawshank Redemption, The (1994)=875
==>Toy Story 2 (1999)=871
These are all fine movies and ones that are generally considered “good.” However, note that the recommendation traversal starts not from a particular user, but from a particular movie (i.e. Toy Story). If instead the traversal starts from a particular user, then basic collaborative filtering is implemented.
Given a user, what movies do they like, who else likes those movies, and what other movies do those users like that are not already liked by that initial user.
Expressing this in Gremlin is left as an exercise to the more than casually interested reader. Feel free to email me your solution for comments (or post it to the Gremlin-Users mailing list).
Mixing in Content-Based Recommendation
The previous section returned generally good movies. However, if I’m a parent and I want to find a good movie like Toy Story for my children to watch, I probably do not want them watching Silence of the Lambs. To remedy this situation, it is possible to mix collaborative filtering and content-based recommendation into a traversal that yields “Toy Story good” movies of the same genera.
gremlin> v.out('hasGenera').genera
==>Animation
==>Children's
==>Comedy
- Which movies are most highly co-rated with Toy Story that share a genera with Toy Story? (return the top 10)
gremlin> m = [:]
gremlin> x = [] as Set
gremlin> v.out('hasGenera').aggregate(x).back(2).corated(3).filter{it != v}.out('hasGenera').retain(x).back(2).title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>American Beauty (1999)=949
==>Back to the Future (1985)=876
==>Toy Story 2 (1999)=871
==>Princess Bride, The (1987)=851
==>Groundhog Day (1993)=843
==>Shakespeare in Love (1998)=807
==>Forrest Gump (1994)=775
==>Men in Black (1997)=747
==>E.T. the Extra-Terrestrial (1982)=737
==>Bug's Life, A (1998)=700
This ranking makes sense, but still, for the stiflingly worrisome parent, movies like American Beauty (classified as a comedy) may not be considered appropriate for children. How about only considering those movies that share all the same generas with Toy Story?
- Which movies are most highly co-rated with Toy Story that share all generas with Toy Story? (return the top 10)
gremlin> m = [:]
gremlin> x = [] as Set
gremlin> v.out('hasGenera').aggregate(x).back(2).corated(3).filter{it != v}.filter{it.out('hasGenera')>>[] as Set == x}.title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>Toy Story 2 (1999)=871
==>Bug's Life, A (1998)=700
==>Chicken Run (2000)=465
==>American Tail, An (1986)=140
==>Aladdin and the King of Thieves (1996)=39
==>American Tail: Fievel Goes West, An (1991)=37
==>Rugrats Movie, The (1998)=34
==>Adventures of Rocky and Bullwinkle, The (2000)=21
==>Saludos Amigos (1943)=4
Conclusion
Recommendation is a popular technique used for separating the wheat from the chaff. For systems with numerous items, helping a user find items that not only satiate their tastes, but also their momentary needs, the algorithms presented can be used. It is important to note that at some level of abstraction, collaborative filtering and content-based recommendation are the same. This idea is made salient when considering that ratings are yet another feature of an item. More specifically, highly co-rated movies of a movie are features of that movie. This idea gets to the flexibility of the property graph data structure and the notion of abstract/inferred/derived relationships.
The itemization below presents recommendation themes that can be further explored by the interested reader. The graph allows an analyst to slice and dice the data in various ways. With enough data (many types of things and many types of relationships), there are numerous mappings that can be made. That is the power of the graph traversal pattern.
- Make use of genera taxonomies to explore genera generalization and specification.
- Determine if gender is a significant factor in the preference of movies.
- Recommend movies with filtering based on the occupation (or age) of the user.
- Mix in other movie features such as director, actors, and motion picture ratings.
- Makes use of a collection of start movies (e.g. “Given Toy Story and Big Trouble in Little China, …”). Realize that a single user can be seen as a reference to such a collection (the movies they highly rated).
- “What do the users prefer who don’t like the same movies as me?”
- “What do the users dislike who like the movies I dislike?” (non-collaborative filtering)
- Use gender, zipcode, and ratings to recommend users a movie watching date (“watch movie X with user Y“).
- Use range filters (e.g. [0..100]) or random walks for fined grained control of the execution speed (i.e. path sampling).
Related Articles
Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,” Proceedings of the International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Lecture Notes in Artificial Intelligence, volume 5712, pp. 813–820, April 2009.
Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques and Applications, eds. S. Sakr, E. Pardede, IGI Global, ISBN:9781613500538, August 2011.
Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support the Scholarly Communication Process,” KRS-2009-02, April 2009.
Trackbacks & Pingbacks
- A Graph-Based Movie Recommender Engine (via Marko A. Rodriguez) « minimal
- A Graph-Based Movie Recommender Engine « Another Word For It
- A Graph-Based Movie Recommender Engine « Marko A. Rodriguez | www.mawir.net
- Federal Election Commission Campaign Data Analysis
- Federal Election Commission Campaign Data Analysis « Another Word For It
- Graph Theory and Network Science « Aurelius
- Neo4j on Heroku – Part One « Max De Marzi
- Neo4j on Heroku (pts. 1, 2 & 3) « Another Word For It

Impressed by your didactorial(*) facility.
Thanks, Marko.
Pierre
(*) didactic tutorial, not to be confused with dictatorial
Just out of curiosity: did you do a quick benchmark of the result? (how it would stack up against the winner algorithm in the contest?)
I had timings in the post for each of the queries initially though I took them out to reduce clutter and to not ground results to what my laptop chip can do. In general, you are looking anywhere from 0.5 to 3 seconds from the Gremlin console (depending on the query complexity). Finally, you can control the number of clock-cycles dedicated to the traversal by making use of range filters (e.g. [0..100]) or random sampling (e.g. out(‘likes’).random(0.5)).
Very nice Marko, great blog!
Thanks so much for this. It’s a great primer that I’ve been looking for to learn about these technologies.
Can this be called a logical algorithm for recommandation ?
In many ways, yes. I’m describing a term logic approach to recommendation.
From foundational rules (paths) such as this, it is possible to overlay more and more paths and account for more and more information thus, yield logically complex representations of what people like. Moreover, everyone has their own path and thus, their own interpretation of what is “good” (i.e. no single traversal to rule them all.).
An area of focus for me these days is automagically deriving (and weighting) these paths from user behavior in the system (e.g. clickstream behavior). In other words, generating a path expression that summarizes the user’s flow through the system. Given this path expression, the user’s future behavior can be simulated and thus, provided as a prediction/recommendation to the present representation of the user.
Thank you for your insightful questions.
Marko.
Thanks for the answer.
I don’t think Term Logic can be competitive in terms of scale against machine learning. I would rather use term logic to explain the result of another algorithm.
Did you try Genetic algorithm to generate path ?
Is there any plan to support higher logic order, like does Prolog, in the Gremlin stack ?
It is hard to say, in a blanket manner, that term logic is not competitive with “machine learning.” It really depends on your underlying representation, the algorithms you are trying to evaluate, and the processing infrastructure you are using.
First, graph databases make use of an underlying linked list data structure and thus, are optimized for processes that walk/traverse such structures. Logical reasoning and traversals can be see as similar processes (more about this in the next paragraph). Second, what do you mean by “machine learning?” What algorithm are you talking about. If you are talking about matrix analysis and recommendation, then these have complex running times (e.g. eigenvector analysis, n * n-1 spearman/pearson correlations, etc.).
With respect to Prolog, you can do Prolog-style querying using Gremlin. See the pattern match features of Gremlin. If your knowledge-base (your facts) are statements and inference is the following of the linkage of facts, then deduction is forward traversal.
Induction, abduction, and exemplification can also be simulated using a similar technique — see this article about a multi-relational, evidential logic that is conducive to graph traversal. If you are interested, evidential logic is a term logic developed by Pei Wang.
At the end of the day, the competitiveness of one approach vs. another is all about the data representation, algorithm complexity, and processing infrastructure.
Thanks again for your stimulating comment thread,
Marko.
I think when you said “genera” what you meant was “genre.” It’s a French-derived word so it’s not spelled as it sounds. I was confused at first but then I put the pieces together.
A very thorough walkthrough though, thanks!
genus => genera (lat. “birth”, “race”, “kind”)
opus => opera (lat. “work”)
corpus => corpora (lat. “body”)
genre noun
\ˈzhän-rə, ˈzhäⁿ-; ˈzhäⁿr; ˈjän-rə\
1: a category of artistic, musical, or literary composition characterized by a particular style, form, or content
Nice work.
Do you have any quantitative evaluation in terms of prediction quality and runtime?
Would this scale to datasets of the size of Netflix?
There are various graph databases on the market these days. While Gremlin connects to various vendors (via Blueprints), I’m most familiar with Neo4j. Neo4j can theoretically hold ~64 billion “things” (vertices+edges) (see this page). However, in practice, with reasonable machinery, ~5-10 billion “things” is what you are looking at. Next, depending on the complexity of your graph structure (e.g. super nodes), runtimes vary. In short, at the size of NetFlix, the run-of-the-mill graph schema will need to be optimized to account for the statistics of the graph.
This is good posting that give us very useful info about movie engine
Excellent post, thank you Marko !
I was wondering if you could expand more about making use of collections. If my task is to match users with similar movie tastes, and all users have a number of movies they have highly rated, how can I recommend users by collections who are similar using context & collaborative filtering rather by comparing individual movies?
Thanks,
Naomi
will this tutorial work with OrientDB
Yes. It will work for any Blueprints-enabled graph database. That is what makes TinkerPop so great.