Project 2 - Document Similarity

Due - Monday September 24, 2007

The goal of this project is to give you more experience with file i/o, arrays, and string manipulation. For this project, you will write a program that determines whether two text files are similar. You will count the frequency with which each word occurs in each document and calculate the cosine similarity for the documents. See http://wordhoard.northwestern.edu/userman/analysis-comparingtexts.html for additional information.

First, you will need to process each document and create a sorted array of all words that appear in either document. Along with the word, you will need to keep the count of how many times it appears in document 1 and how many times it appears in document 2. I recommend using a single array that stores objects that contain a word and two integers representing the frequency count for each document. This will give you a score vector for document 1 and a score vector for document 2, where the score is the frequency with which the word appears in the document. Note that 0 is a valid score.

Once you have processed both documents, you will need to calculate the cosine similarity. According the the WordHoard web page referenced above, the cosine similarity is "the vector dot product of the score vectors for the two works divided by the square root of the product of the vector dot products of each score vector with itself". Following is an example:

Document 1: The cat and the dog ran.
Document 2: The white cat and the brown cat played.


Word
and brown cat dog played ran the white

Doc 1
Frequency
1 0 1 1 0 1 2 0

Doc 2
Frequency
1 1 2 0 1 0 2 1

The vector dot product of the vectors is: (1*1)+(0*1)+(1*2)+(1*0)+(0*1)+(1*0)+(2*2)+(0*1) = 7
The square root of the dot product of vector 1 and itself is: sqrt((1*1)+(0*0)+(1*1)+(1*1)+(0*0)+(1*1)+(2*2)+(0*0)) = 2.83
The square root of the dot product of vector 2 and itself is: sqrt((1*1)+(1*1)+(2*2)+(0*0)+(1*1)+(0*0)+(2*2)+(1*1)) = 3.46
Cosine Similarity = (7/(2.83*3.46)) = .71

In addition to the similarity score, your program must generate and display the following statistics about each documents:

  1. The total number of words in the document.
  2. The total number of lines in the document.
Grading:
15Overall design and documentation
10Compiles and runs
10Demonstration and oral responses
15Correctly processes input files
20Correctly generates sorted array(s)
20Correctly calculates similarity score
10Correctly calculates and displays document statistics

Note that a design meeting is not required for this project. However, if you schedule a design meeting you will be guaranteed 5/15 design points. If you do not schedule a design meeting, the 15 design points will be awarded based on the design of the program you submit.

I recommend that you test your program by using some of the large texts that you can download from Project Gutenberg. For example, you might compare Shakespeare's All's Well that Ends Well to The Comedy of Errors.

Due 9:40AM Monday September 24, 2007

  1. Complete and submit your working code. Turn in a hard copy in class and place a copy of your .java files in /home/submit/cs112/username.
  2. Make sure that each function is well documented. Your documentation should specify the type and function of the input parameters and output.
  3. Run your program on a variety of inputs ensuring that all error conditions are handled correctly.
Note: No portion of your code may be copied from any other source including another text book, a web page, or another student (current or former). You must provide citations for any sources you have used in designing and implementing your program.
Sami Rollins