Skip to content

A Graph-Based Movie Recommender Engine

September 22, 2011

A recommender engine helps a user find novel and interesting items within a pool of resources. There are numerous types of recommendation algorithms and a graph can serve as a general-purpose substrate for evaluating such algorithms. This post will demonstrate how to build a graph-based movie recommender engine using the publicly available MovieLens dataset, the graph database Neo4j, and the graph traversal language Gremlin. Feel free to follow along in the Gremlin console as the post will go step-by-step from data acquisition, to parsing, and ultimately, to traversing.

The MovieRatings Dataset

The GroupLens research group has made available a corpus of movie ratings. There are 3 versions of this dataset: 100 thousand, 1 million, and 10 million ratings. This post makes use of the 1 million ratings version of the dataset. The dataset can be downloaded from the MovieRatings website (~6 megs in size). The raw dataset is composed of three files: users.dat, movies.dat, and ratings.dat. More information about the particulars of each file is described in the respective README.txt.

Getting Started with Gremlin

All of the code examples can be cut and pasted into the Gremlin console or into a Groovy/Java class within a larger application. For the sake of simplicity, simply follow along with the Gremlin console. Gremlin 1.3 is available for download at this location.

marko$ ./gremlin.sh

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> 

Generating a MovieRatings Graph

Before getting recommendations of which movies to watch, it is important to first parse the raw MovieLens data according to the graph schema defined above. While this is not the only way in which the data could be represented as a property graph, the representation above is natural and useful for the traversals presented later.

The data will be inserted into the graph database Neo4j. The Gremlin/Groovy code below creates a new Neo4j graph, removes an unneeded default edge index, and sets the transaction buffer to 1000 mutations per commit. Finally, to make the data easier to understand, a map is hardcoded which will allow for the representation of a user’s occupation as a string instead of its integer id.

g = new Neo4jGraph('/tmp/movieRatings')
g.dropIndex("edges")
g.setMaxBufferSize(1000)

occupations = [0:'other', 1:'academic/educator', 2:'artist',
  3:'clerical/admin', 4:'college/grad student', 5:'customer service',
  6:'doctor/health care', 7:'executive/managerial', 8:'farmer',
  9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer',
  13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed',
  17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer'] 

Parsing Movie Data

The file movie.dat contains a list of movies. Each row of the raw file has 3 columns: movieId, title, and generas.

marko$ more -n6 movies.dat 
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy

The code to parse this data into Neo4j and according to the diagrammed schema is presented below. The details of the code are left to focused eye of the interested reader. For others, simply cutting and pasting this code into the Gremlin console will suffice to move forward.

new File('movies.dat').eachLine {def line ->
  def components = line.split('::');
  def movieVertex = g.addVertex(['type':'Movie', 'movieId':components[0].toInteger(), 'title':components[1]]);
  components[2].split('\\|').each { def genera ->
    def hits = g.idx(T.v)[[genera:genera]].iterator();
    def generaVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Genera', 'genera':genera]);
    g.addEdge(movieVertex, generaVertex, 'hasGenera');
  }
}

Parsing User Data

The file users.dat contains a list of users. Each row of the raw file has 5 columns: userId, gender, age, occupation, and zipcode. For schema simplicity, the zipcode field will be ignored.

ml-1m$ more -n6 users.dat 
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
new File('users.dat').eachLine {def line ->
  def components = line.split('::');
  def userVertex = g.addVertex(['type':'User', 'userId':components[0].toInteger(), 'gender':components[1], 'age':components[2].toInteger()]);
  def occupation = occupations[components[3].toInteger()];
  def hits = g.idx(T.v)[[occupation:occupation]].iterator();
  def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Occupation', 'occupation':occupation]);
  g.addEdge(userVertex, occupationVertex, 'hasOccupation');
}

Parsing Ratings Data

The file ratings.dat contains the ratings that a user provided for the movies they watched. Each row of the raw file has 4 columns: userId, movieId, stars (1-5 rating scale), and timestamp. The timestamp field will be ignored.

ml-1m$ more -n6 ratings.dat 
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

Given that there are 1 million ratings, this code fragment will take a minute or so to evaluate. For those dealing with large datasets and Neo4j, its possible to use Neo4jBatchGraph which is new with Blueprints 1.0.

new File('ratings.dat').eachLine {def line ->
  def components = line.split('::');
  def ratedEdge = g.addEdge(g.idx(T.v)[[userId:components[0].toInteger()]].next(), g.idx(T.v)[[movieId:components[1].toInteger()]].next(), 'rated');
  ratedEdge.setProperty('stars', components[2].toInteger());
}

To commit any data left over in the transaction buffer, successfully stop the current transaction. Now the data is persisted to disk. If you plan on leaving the Gremlin console, be sure to g.shutdown() the graph first.

g.stopTransaction(TransactionalGraph.Conclusion.SUCCESS)

Before moving onto recommender algorithms, lets make sure the graph looks correct.

gremlin> g.V.count()
==>9962
gremlin> g.E.count()
==>1012657
gremlin> g.V[[type:'Movie']].count()
==>3883
gremlin> g.V[[type:'Genera']].count()
==>18
gremlin> g.V[[type:'User']].count()  
==>6040
gremlin> g.V[[type:'Occupation']].count()
==>21
gremlin> occupations.size()
==>21

As a side: What is the distribution of occupations amongst the user population?

gremlin> m = [:]           
gremlin> g.V[[type:'User']].out('hasOccupation').occupation.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}
==>college/grad student=759
==>other=711
==>executive/managerial=679
==>academic/educator=528
==>technician/engineer=502
==>programmer=388
==>sales/marketing=302
==>writer=281
==>artist=267
==>self-employed=241
==>doctor/health care=236
==>K-12 student=195
==>clerical/admin=173
==>scientist=144
==>retired=142
==>lawyer=129
==>customer service=112
==>homemaker=92
==>unemployed=72
==>tradesman/craftsman=70
==>farmer=17

What about the average age?

gremlin> g.V[[type:'User']].age.mean()                                         
==>30.639238410596025

Traversing the MovieLens Graph

Now that the MovieLens data has been parsed and represented as a graph, the data is ready to be traversed (i.e. queried). There are two general types of recommendation: collaborative filtering and content-based recommendation.

In collaborative filtering, the rating/liking/preference behavior of users is correlated in order to recommend the favorites of one user to another, similar user.

I like the movies you like, what other movies do you like that I haven’t seen?

With content-based recommendation, if a particular item is liked, then its features are analyzed in order to find other items with analogous features.

I like romantic zombie comedies, what other romantic zombie comedies are there?

Basic Collaborative Filtering

This section will build up an ever more complex traversal in order to explain the foundation of graph-based collaborative filtering and hint at the variations that can be incorporated to meet domain specific requirements. Lets start from Toy Story the movie.

gremlin> v = g.idx(T.v)[[title:'Toy Story (1995)']] >> 1                                             
==>v[1]
gremlin> v.map()                                        
==>movieId=1
==>title=Toy Story (1995)
==>type=Movie
  • Which users gave Toy Story more than 3 stars? (only return 5 results)
gremlin> v.inE('rated').filter{it.getProperty('stars') > 3}.outV.userId[0..4] 
==>v[3902]
==>v[3912]
==>v[3916]
==>v[3918]
==>v[3920]

This traversal doesn’t yield useful information. However, when it is used within a larger path expression, collaborative filtering is effected.

  • Which users gave Toy Story more than 3 stars and what other movies did they give more than 3 stars to? (only return 5 results)
gremlin> v.inE('rated').filter{it.getProperty('stars') > 3}.outV.outE('rated').filter{it.getProperty('stars') > 3}.inV.title[0..4]
==>One Flew Over the Cuckoo's Nest (1975)
==>Erin Brockovich (2000)
==>Bug's Life, A (1998)
==>Ben-Hur (1959)
==>Christmas Story, A (1983)

  1. Start from Toy Story — v
  2. Get the incoming rated edges — inE(‘rated’)
  3. Filter out those edges whose star property is less than 4 — filter{it.getProperty(‘stars’) > 3}
  4. Get the tail user vertices of the remaining edges — outV
  5. Get the rating edges of those user vertices — outE(‘rated’)
  6. Filter out those edges whose star property is less than 4 — filter{it.getProperty(‘stars’) > 3}
  7. Get the head movie vertices of the remaining edges — inV
  8. Get the string title property of those movie vertices — title
  9. Only emit the first 5 titles — [0..4]

What is the meaning of the aggregate of these atomic-steps? — highly co-rated. With Gremlin, it is possible to work at a higher level of abstraction by making use of a user defined steps. The step defined below is called corated and it bundles all these atomic-steps together.

gremlin> Gremlin.defineStep('corated',[Vertex,Pipe], { def stars ->
  _().inE('rated').filter{it.getProperty('stars') > stars}.outV.outE('rated').filter{it.getProperty('stars') > stars}.inV})
==>null
gremlin> v.corated(3).title[0..4]
==>One Flew Over the Cuckoo's Nest (1975)
==>Erin Brockovich (2000)
==>Bug's Life, A (1998)
==>Ben-Hur (1959)
==>Christmas Story, A (1983)

This traversal will return a list of movies. Notice that this list has many duplicates. This is due to the fact that users who like Toy Story also like many of the same other movies. This similarity of human behavior is what is capitalized on in recommendation algorithms.

  • How many of Toy Story’s highly co-rated movies are unique?
gremlin> v.corated(3).count()      
==>268493
gremlin> v.corated(3).uniqueObject.count()
==>3353

Given that there are 268,493 highly rated paths from Toy Story to other movies and only 3,353 of those movies are unique, it is possible to use these duplicates as a ranking mechanism–ultimately, a recommendation.

  • Which movies are most highly co-rated with Toy Story? (return the top 10)
gremlin> m = [:]                                                                                                  
gremlin> v.corated(3).title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] 
==>Toy Story (1995)=1655
==>Star Wars: Episode V - The Empire Strikes Back (1980)=1000
==>Star Wars: Episode IV - A New Hope (1977)=998
==>American Beauty (1999)=949
==>Matrix, The (1999)=925
==>Raiders of the Lost Ark (1981)=922
==>Silence of the Lambs, The (1991)=887
==>Saving Private Ryan (1998)=878
==>Back to the Future (1985)=876
==>Shawshank Redemption, The (1994)=875

The highest ranked result makes sense. It says, people who like Toy Story, also like Toy Story. In order to remove these reflexive paths, simply filter out Toy Story.

gremlin> m = [:]                                                                                                  
gremlin> v.corated(3).filter{it != v}.title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] 
==>Star Wars: Episode V - The Empire Strikes Back (1980)=1000
==>Star Wars: Episode IV - A New Hope (1977)=998
==>American Beauty (1999)=949
==>Matrix, The (1999)=925
==>Raiders of the Lost Ark (1981)=922
==>Silence of the Lambs, The (1991)=887
==>Saving Private Ryan (1998)=878
==>Back to the Future (1985)=876
==>Shawshank Redemption, The (1994)=875
==>Toy Story 2 (1999)=871

These are all fine movies and ones that are generally considered “good.” However, note that the recommendation traversal starts not from a particular user, but from a particular movie (i.e. Toy Story). If instead the traversal starts from a particular user, then basic collaborative filtering is implemented.

Given a user, what movies do they like, who else likes those movies, and what other movies do those users like that are not already liked by that initial user.

Expressing this in Gremlin is left as an exercise to the more than casually interested reader. Feel free to email me your solution for comments (or post it to the Gremlin-Users mailing list).

Mixing in Content-Based Recommendation

The previous section returned generally good movies. However, if I’m a parent and I want to find a good movie like Toy Story for my children to watch, I probably do not want them watching Silence of the Lambs. To remedy this situation, it is possible to mix collaborative filtering and content-based recommendation into a traversal that yields “Toy Story good” movies of the same genera.

gremlin> v.out('hasGenera').genera
==>Animation
==>Children's
==>Comedy
  • Which movies are most highly co-rated with Toy Story that share a genera with Toy Story? (return the top 10)
gremlin> m = [:]
gremlin> x = [] as Set
gremlin> v.out('hasGenera').aggregate(x).back(2).corated(3).filter{it != v}.out('hasGenera').retain(x).back(2).title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
==>American Beauty (1999)=949
==>Back to the Future (1985)=876
==>Toy Story 2 (1999)=871
==>Princess Bride, The (1987)=851
==>Groundhog Day (1993)=843
==>Shakespeare in Love (1998)=807
==>Forrest Gump (1994)=775
==>Men in Black (1997)=747
==>E.T. the Extra-Terrestrial (1982)=737
==>Bug's Life, A (1998)=700

This ranking makes sense, but still, for the stiflingly worrisome parent, movies like American Beauty (classified as a comedy) may not be considered appropriate for children. How about only considering those movies that share all the same generas with Toy Story?

  • Which movies are most highly co-rated with Toy Story that share all generas with Toy Story? (return the top 10)
gremlin> m = [:] 
gremlin> x = [] as Set                                                                                                                                    
gremlin> v.out('hasGenera').aggregate(x).back(2).corated(3).filter{it != v}.filter{it.out('hasGenera')>>[] as Set == x}.title.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]                                                                                                    
==>Toy Story 2 (1999)=871
==>Bug's Life, A (1998)=700
==>Chicken Run (2000)=465
==>American Tail, An (1986)=140
==>Aladdin and the King of Thieves (1996)=39
==>American Tail: Fievel Goes West, An (1991)=37
==>Rugrats Movie, The (1998)=34
==>Adventures of Rocky and Bullwinkle, The (2000)=21
==>Saludos Amigos (1943)=4

Conclusion

Recommendation is a popular technique used for separating the wheat from the chaff. For systems with numerous items, helping a user find items that not only satiate their tastes, but also their momentary needs, the algorithms presented can be used. It is important to note that at some level of abstraction, collaborative filtering and content-based recommendation are the same. This idea is made salient when considering that ratings are yet another feature of an item. More specifically, highly co-rated movies of a movie are features of that movie. This idea gets to the flexibility of the property graph data structure and the notion of abstract/inferred/derived relationships.

The itemization below presents recommendation themes that can be further explored by the interested reader. The graph allows an analyst to slice and dice the data in various ways. With enough data (many types of things and many types of relationships), there are numerous mappings that can be made. That is the power of the graph traversal pattern.

  • Make use of genera taxonomies to explore genera generalization and specification.
  • Determine if gender is a significant factor in the preference of movies.
  • Recommend movies with filtering based on the occupation (or age) of the user.
  • Mix in other movie features such as director, actors, and motion picture ratings.
  • Makes use of a collection of start movies (e.g. “Given Toy Story and Big Trouble in Little China, …”). Realize that a single user can be seen as a reference to such a collection (the movies they highly rated).
  • “What do the users prefer who don’t like the same movies as me?”
  • “What do the users dislike who like the movies I dislike?” (non-collaborative filtering)
  • Use gender, zipcode, and ratings to recommend users a movie watching date (“watch movie X with user Y“).
  • Use range filters (e.g. [0..100]) or random walks for fined grained control of the execution speed (i.e. path sampling).

Related Articles

Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,” Proceedings of the International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Lecture Notes in Artificial Intelligence, volume 5712, pp. 813–820, April 2009.

Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques and Applications, eds. S. Sakr, E. Pardede, IGI Global, ISBN:9781613500538, August 2011.

Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support the Scholarly Communication Process,” KRS-2009-02, April 2009.

Follow

Get every new post delivered to your Inbox.

Join 137 other followers

%d bloggers like this: