Lab 7 Finding similar users
Due Wednesday, May 3 11:55pm. Submission through SVN.
Please submit your work in SVN directory
https://www.cs.usfca.edu/svn/< your username >/cs112/lab7
e.g. https://www.cs.usfca.edu/svn/ejung/cs112/lab7
Requirements
- This lab will get you ready for the movie recommendation functionality in project 5. Imagine a 2-dimensional array of double. Each row represents a user and each column represents a movie. Let's call the array ratingDB. ratingDB[i][j] is the rating that user_i assigned to movie_j. ratingDB[i][j] is between 0.0 and 5.0 (inclusive) if user_i has watched movie_j and -1 otherwise. To recommend a movie that user_i is likely to enjoy, your program needs to identify other users whose movie preference is similar to user_i's.
- Your program Driver.java should prompt for an input file, create a 2-dimensional double array ratingDB from the file, and find the most similar user for each user. You may use the third example algorithm below, or propose your own algorithm. If you use your own, you need to explain how your algorithm compares to the provided ones in the README file. Note that the quality of the program is part of the grade. The input file will contain the array size in the first line. The first number is the number of rows, and the second number is the number of columns. Starting from the second line, each line is one row. Your program should prompt again if the file cannot be opened. The example run looks like below:
Enter the file name = ratings.txt
user_0 most similar to: user_2
user_1 most similar to: user_3, user_4
user_2 most similar to: user_0
user_3 most similar to: user_1, user_4
user_4 most similar to: user_1, user_3
- In the README file, explain how your program works, and also answer the following questions.
- If you have implemented your own algorithm, explain how your program finds the most similar user(s) for each user.
- How will you integrate lab 7 code into your project 5? Explain what methods you will add to project 5, and what are the inputs and the outputs of these methods. Also, explain how these methods will be used to provide movie recommendations.
- Extra-credit (up to 10%): Indicate in the README file if you want to be graded for the extra-credit. If your algorithm is better in finding similar users than the third example algorithm below, you may get extra credit. For example, imagine 3 users, user_0, user_1, and user_2. user_0 and user_1 have watched 3 movies in common and rated them the same. user_0 and user_2 have watched 5 movies in common and rated them the same. In the third example algorithm, user_1 and user_2 are equally similar to user_0. However, you could argue that user_2 is more similar to user_0 as user_2 has the same ratings in more movies than user_1. Can you design an algorithm that favors user_2 over user_1?
How to find similar users
Finding similar users is not an easy problem. To get started, let's find users with the most movies in common. Here is the pseudo-code.
for each user_i1
for each user_i2, i1!=i2
for each movie_j
if (user_i1 and user_i2 both have seen the movie_j)
common_i2++;
find the user_i2 with the highest common_i2.
If there are multiple user with the highest common_i2, then print them all.
This example has many shortcomings. For instance, if two users have seen the same 5 movies, and one user rated then (5,4,3,2,1) and the other user (1,2,3,4,5), their preferences are the opposite, but the example above ignores that. A better way is to compare their ratings for the movies.
for each user_i1
for each user_i2, i1!=i2
for each movie_j
if (user_i1 and user_i2 both have seen the movie_j)
common_i2 = common_i2 + the difference of their ratings;
find the user_i2 with the smallest common_i2.
If there are multiple user with the smallest common_i2, then print them all.
The second example still has shortcomings. This ignores how many movies they have watched in common. If user_0 and user_1 have watched movie_0 and gave 3 stars and 5 stars respectively, they come out more similar than user_0 and user_2, who both have watched movie_1, movie_2, movie_3 and their ratings differ by 1 star for all 3 movies. You can normalize the rating difference value by the number of commonly watched movies.
for each user_i1
for each user_i2, i1!=i2
for each movie_j
if (user_i1 and user_i2 both have seen the movie_j)
common_i2 = common_i2 + the difference of their ratings;
total_i2++;
find the user_i2 with non-zero total_i2 and the smallest common_i2/total_i2.
If there are multiple user with the smallest common_i2, then print them all.