Project 4 - An Indexer
Due - Friday, April 20, 2007
The goal of this project is to give you more experience with
linked lists. For this project, you will create an indexer
similar to what might be used by a search engine. Your indexer
will process a large text file and create a sorted array of
all words occurring in the document. For each word, you will keep
a linked list of positions where the word occurs. You will
also provide two look-up operations. The first operation will
take as input a single word and will return all positions where
the word occurs. The second operation will take as input a
two-word sequence and will return all positions where the sequence
occurs.
Your program will operate in two steps. In step 1 you will
process the input text file and build the index---the array of
words and positions in which the word occurs. In step 2 you will
process a second file containing several 1 and 2 word queries.
For each query, you will perform a lookup, the result of which
will be a linked list containing all of the positions where the
word or phrase occurs. You will write the result to a text file
in the format word1 word2: position1 position2. For
example and the: 34 78 356 would indicate that the phrase
"and the" appears at positions 34, 78, and 356 in the document.
Following is the design I expect you to implement. You may extend
this design, implementing additional classes and methods as
necessary. However, if you wish to change this design you must
first seek approval from me.
LinkedList
The LinkedList class will be a standard linked list. Note, you
will have to implement a Node class to support the LinkedList, and
you may not use the LinkedList class provided in java.util.
LinkedList will contain data members head and tail
and will support, at minimum, the methods insertAtHead and
insertAtTail.
WordEntry
The WordEntry class will contain two data members: a String representing a particular word and a LinkedList of Integers which represent the positions where the word occurs. The class will also support the following methods:
- getWord() - This method will return the word represented by this entry.
- addPosition(position) - This method will take as input an int representing a position and will add it to the tail of the linked list containing the positions where the word occurs.
- getPositionList() - This method will return the linked list of positions where the word occurs.
Index
The Index class will contain two data members: an array of WordEntry objects and an int to represent the number of entries currently contained in the array. The class will also support the following methods:
- addPosition(word, position) - This method will take as input a word and a position and will add the entry to the index. To accomplish this, the method will first call the find method to determine whether the word is already in the array of WordEntry objects. If so, it will simply add the new position to the linked list contained in the appropriate WordEntry object. If not, it will create a new WordEntry object containing the word and position and will insert it in the appropriate location in the array of WordEntry objects.
- find(word) - This method will takes as input a word and will perform a binary search to find the position of the WordEntry object containing this word. It will return the position found or -1 if the word does not appear in the array of WordEntry objects. Hint: you will likely need to write a helper method to assist with the binary search.
- insert(wordentry) - This method will takes as input a WordEntry object and will insert it in the array at the correct position. The array will be sorted alphabetically by the word contained in each WordEntry object. Super important hint: in the constructor of Index, you should create an array of some default size, say 1000. As you insert objects, you may discover that you need to allocate more space. If the array is full and you wish to insert another object, you will first allocate an array of twice the size of the current array , copy all elements from the original array to the new array, set the array variable of the Index object to point to the new array, then perform the insertion procedure described above.
- search(word) - This method will takes as input a query word and will return a LinkedList containing the positions where the word occurs in the document.
- search(word1, word2) - This method will takes as input two query words and will return a LinkedList containing the positions where the phrase occurs in the document. You will first perform a search for word1 to retrieve the list of positions where it occurs. You will then perform a search for word2 to retrieve the list of positions where it occurs. You will then create (and return) a third LinkedList that contains the positions where the phrase occurs. To build the third list, you will essentially need to calculate the intersection of the other two lists. In other words, if word1 appears at position 34 and word2 appears at position 35, the phrase occurs starting at position 34 and 34 should be added to the third list. Keep in mind that the position lists should be sorted; you always tack on the new position at the end of the list.
FileProcessor
The FileProcessor class will open the text file and build the Index. It will have one method:
- buildIndex(file) - This method will open the file, read in words one at a time, and add each word and its corresponding position to an Index object. It will then return the Index object.
SearchProcessor
The SearchProcessor class will open the file containing the query terms, process the queries, and write the results to a new file. It will have one method:
- processQueries(file, index) - This method will open the file and read in queries one at a time. It will use the index to look up the terms and will write the results to a new output file.
Implementation Hints
- Visit Project Gutenberg to download some sample large text files.
- Punctuation can be a pain. For example, if a word appears at the end of a sentence and is followed by a period (e.g., house.), it will not match the same word without the punctuation. I will not deduct points for projects that ignore punctuation. Though, you can also use the String replace method to get rid of standard punctuation.
Due 9:40AM, Friday April 20, 2007
- Complete and submit your working code. Turn in a hard copy in class and place a copy of your .java files in /home/submit/cs112-s07/username.
- Make sure that each function is well documented. Your documentation should specify the type and function of the input parameters and output.
- Run your program on a variety of inputs ensuring that all error conditions are handled correctly.
Note: No portion of your code may be copied from any other
source including another text book, a web page, or another
student (current or former). You must provide citations for any
sources you have used in designing and implementing your program.
Sami Rollins