The goal of this project is to give you more experience with file i/o,
arrays, and string manipulation. For this project, you will write a program
that determines whether two text files are similar. You will count the
frequency with which each word occurs in each document and calculate the
cosine similarity for the documents. See http://wordhoard.northwestern.edu/userman/analysis-comparingtexts.html
for additional information.
First, you will need to process each document and create a sorted array of
all words that appear in either document. Along with the word, you will need
to keep the count of how many times it appears in document 1 and how many
times it appears in document 2. I recommend using a single array that stores
objects that contain a word and two integers representing the frequency count
for each document. This will give you a score vector for document 1
and a score vector for document 2, where the score is the frequency
with which the word appears in the document. Note that 0 is a valid score.
Once you have processed both documents, you will need to calculate the cosine
similarity. According the the WordHoard web page referenced above, the cosine
similarity is "the vector dot product of the score vectors for the two works
divided by the square root of the product of the vector dot products of each
score vector with itself". Following is an example:
Document 1: The cat and the dog ran.
Document 2: The white cat and the brown cat played.
Word |
and | brown | cat | dog | played | ran | the | white |
Doc 1 Frequency |
1 | 0 | 1 | 1 | 0 | 1 | 2 | 0 |
Doc 2 Frequency |
1 | 1 | 2 | 0 | 1 | 0 | 2 | 1 |
The vector dot product of the vectors is:
(1*1)+(0*1)+(1*2)+(1*0)+(0*1)+(1*0)+(2*2)+(0*1) = 7
The square root of the dot product of vector 1 and itself is:
sqrt((1*1)+(0*0)+(1*1)+(1*1)+(0*0)+(1*1)+(2*2)+(0*0)) = 2.83
The square root of the dot product of vector 2 and itself is:
sqrt((1*1)+(1*1)+(2*2)+(0*0)+(1*1)+(0*0)+(2*2)+(1*1)) = 3.46
Cosine Similarity = (7/(2.83*3.46)) = .71
In addition to the similarity score, your program must generate and display the following statistics about each documents:
15 | Overall design and documentation |
10 | Compiles and runs |
10 | Demonstration and oral responses |
15 | Correctly processes input files |
20 | Correctly generates sorted array(s) |
20 | Correctly calculates similarity score |
10 | Correctly calculates and displays document statistics |