Project 5 - An Indexer
Due - Friday, November 14, 2008
The goal of this project is to give you more experience with linked
lists. For this project, you will create an indexer similar to what
might be used by a search engine. Your indexer will process a large
text file and create a sorted array of all words
occurring in the document. For each word, you will keep a linked
list of positions where the word occurs. You will also
provide two look-up operations. The first operation will take as input
a single word and will return all positions where the word occurs. The
second operation will take as input a two-word sequence and will return
all positions where the sequence occurs.
Your program will operate in two steps. In step 1 you will process the
input text file and build the index---the array of words and positions
in which the word occurs. In step 2 you will process a second file
containing several 1 and 2 word queries. For each query, you will
perform a lookup, the result of which will be a linked list containing
all of the positions where the word or phrase occurs. You will write
the result to a text file in the format word1 word2:
position1 position2. For example and the: 34 78 356
would indicate that the phrase "and the" appears at positions 34, 78,
and 356 in the document.
Following is the design I expect you to implement. You may extend this
design, implementing additional classes and methods as necessary.
However, if you wish to change this design you must first seek approval
from me.
LinkedList
The LinkedList class will be a standard linked list. You may use the LinkedList class you wrote for Lab 7.
WordEntry
The WordEntry class will contain two data members: a String
representing a particular word and a LinkedList of Integers which
represent the positions where the word occurs. The class will also
support the following methods:
getWord()
- This method will return the
word represented by this entry.
addPosition(position)
- This method will
take as input an int representing a position and will add it to the tail
of the linked list containing the positions where the word occurs.
getPositionList()
- This method will
return the linked list of positions where the word occurs.
Index
The Index class will contain two data members: an array of WordEntry
objects and an int to represent the number of entries currently
contained in the array. The class will also support the following
methods:
addPosition(word, position)
- This
method
will take as input a word and a position and will add the entry to the
index. To accomplish this, the method will first call the find method
to determine whether the word is already in the array of WordEntry
objects. If so, it will simply add the new position to the linked list
contained in the appropriate WordEntry object. If not, it will create a
new WordEntry object containing the word and position and will insert
it in the appropriate location in the array of WordEntry objects.
find(word)
- This method will takes as
input a word and will perform a binary search
to find the position of the WordEntry object containing this word. It
will return the position found or -1 if the word does not appear in the
array of WordEntry objects. Hint: you will likely need to write a
helper method to assist with the binary search.
insert(wordentry)
- This method will takes as input a WordEntry object and will insert it
in the array at the correct position. The array will be sorted
alphabetically by the word contained in each WordEntry object. Super
important hint: in the constructor of Index, you should create an array
of some default size, say 1000. As you insert objects, you may discover
that you need to allocate more space. If the array is full and you wish
to insert another object, you will first allocate an array of twice the
size of the current array , copy all elements from the original array
to the new array, set the array variable of the Index object to point
to the new array, then perform the insertion procedure described above.
search(word)
- This method will takes as input a query word and will return a
LinkedList containing the positions where the word occurs in the
document.
search(word1, word2)
- This method
will takes as input two query words and will return a LinkedList
containing the positions where the phrase occurs in the document. You
will first perform a search for word1 to retrieve the list of positions
where it occurs. You will then perform a search for word2 to retrieve
the list of positions where it occurs. You will then create (and
return) a third LinkedList that contains the positions where the phrase
occurs. To build the third list, you will essentially need to calculate
the intersection of the other two lists. In other words, if word1
appears at position 34 and word2 appears at position 35, the phrase
occurs starting at position 34 and 34 should be added to the third
list. Keep in mind that the position lists should be sorted; you always
tack on the new position at the end of the list.
FileProcessor
The FileProcessor class will open the text file and build the Index. It
will have one method:
buildIndex(file)
- This method will open the file, read in words one at a time, and add
each word and its corresponding position to an Index object. It will
then return the Index object.
SearchProcessor
The SearchProcessor class will open the file containing the query
terms, process the queries, and write the results to a new file. It
will have one method:
processQueries(file, index)
- This
method
will open the file and read in queries one at a time. It will use the
index to look up the terms and will write the results to a new output
file.
Implementation Hints
- Visit Project
Gutenberg to download some sample large text files.
- Punctuation can be a pain. For example, if a word appears
at the end of a sentence and is followed by a period (e.g., house.),
it will not match the same word without the punctuation. I will not
deduct points for projects that ignore punctuation. Though, you can
also use the String replace method to get rid of standard punctuation.
Due 9:40AM, Friday, November 14, 2008
- Complete and submit your working code. Turn in a hard copy
in class and place a copy of your .java files in /home/submit/cs112-f08/username.
Note: No portion of your code may be copied from any other
source including another text book, a web page, or another student
(current or former). You must provide citations for any sources you
have used in designing and implementing your program.
Sami
Rollins