Lab 7 Finding similar users

Due Wednesday, May 3 11:55pm. Submission through SVN.
Please submit your work in SVN directory
https://www.cs.usfca.edu/svn/< your username >/cs112/lab7

e.g.
https://www.cs.usfca.edu/svn/ejung/cs112/lab7

Requirements

How to find similar users

  • Finding similar users is not an easy problem. To get started, let's find users with the most movies in common. Here is the pseudo-code.
  • for each user_i1
      for each user_i2, i1!=i2
        for each movie_j
          if (user_i1 and user_i2 both have seen the movie_j) 
            common_i2++;
      find the user_i2 with the highest common_i2. 
      If there are multiple user with the highest common_i2, then print them all.
    
  • This example has many shortcomings. For instance, if two users have seen the same 5 movies, and one user rated then (5,4,3,2,1) and the other user (1,2,3,4,5), their preferences are the opposite, but the example above ignores that. A better way is to compare their ratings for the movies.
  • for each user_i1
      for each user_i2, i1!=i2
        for each movie_j
          if (user_i1 and user_i2 both have seen the movie_j) 
            common_i2 = common_i2 + the difference of their ratings;
      find the user_i2 with the smallest common_i2. 
      If there are multiple user with the smallest common_i2, then print them all.
    
  • The second example still has shortcomings. This ignores how many movies they have watched in common. If user_0 and user_1 have watched movie_0 and gave 3 stars and 5 stars respectively, they come out more similar than user_0 and user_2, who both have watched movie_1, movie_2, movie_3 and their ratings differ by 1 star for all 3 movies. You can normalize the rating difference value by the number of commonly watched movies.
  • for each user_i1
      for each user_i2, i1!=i2
        for each movie_j
          if (user_i1 and user_i2 both have seen the movie_j) 
            common_i2 = common_i2 + the difference of their ratings;
            total_i2++;
      find the user_i2 with non-zero total_i2 and the smallest common_i2/total_i2. 
      If there are multiple user with the smallest common_i2, then print them all.