CS 662: AI Programming
Assignment 3: Focused Crawling

Assigned: September 13
Due: September 20.September 25
30 points total.

Hard copies of the source code, along with the table showing the accuracy of each scorer.

Also, please check your code into your subversion repository. Create a subdirectory called assigment3 for it. Everything necessary to run your code should be in this directory. If anything out of the ordinary is needed to run your code, please provide a README.

In this assignment, you will build a focused crawler. A focused crawler is a webcrawler that looks for pages that fit a particular description. We can implement this using our typical search paradigm. The pseudocode looks like this:
get a start page
while (not done)
   fetch that page and extract its outward links
   if page is "good enough", keep it
   enqueue each of the outward links according to an estimate of their

To make life simple for this assignment, I will be providing you with a great deal of the code that you need. In addition, we will not be crawling the entire Web. Instead, we will be working with a local copy of Wikipedia. (this means we don't need to worry about malformed pages, spider traps, servers timing out, or any other number of annoyances.)

NOTE: you are only to use the local copy of wikipedia installed on scorpio.cs.usfca.edu for this assignment. In particular, do NOT crawl wikipedia's main site (www.wikipedia.org). This is a violation of their terms of use, and could result in other users at USF being blocked from using wikipedia. Anyone found to be crawling wikipedia.org (or any other non-local mirror) will receive an automatic zero for this assignment.

NOTE ALSO: this assignment may take you some time to complete. In particular, fetching several hundred pages from scorpio may take a few minutes (running locally) or longer (if running from a slow connection at home). Some advice:
I have provided you with three classes that do the bulk of the work; your job will be to fill in the gaps.

To begin, wikipage.py contains a class for storing wikipedia pages. It's almost complete, but needs a little work from you.

(5 points) Fix wikipage.py to remove stopwords and non-words (anything containing punctuation). You MUST use a list comprehension to do this. The goal here is to get rid of the 'noise' in the document: words that don't help us determine its content.
(5 points) Currently, wikipage.py stores all hrefs in the outwardLinks variables. Use one or more (you only need one) list comprehensions to remove: external links to sites other than scorpio, navigation links (they begin with '#') and links to edit pages (they contain the string "action=edit"). The goal here is to only keep only links to content-containing Wikipedia pages.

Next, crawler.py contains an almost-complete focused crawler. Once you've fixed wikipage, you can try it out like this:
import crawler
import scorer
c = crawler.crawler()
s = scorer.randomScorer()
c.crawl(10, 0.1, s)
scorer can be found here. More on that in a minute.
crawl() takes six arguments, all of which have defaults: nGoalpages (the number of pages to find), nTotalPages (the total number of pages to crawl), threshold (how good a page's score must be for it to be kept), fractionLinksUsed (what fraction of a page's links should be followed) scorer (an object that scores a page), and startingURL (where to start the crawl).
fractionLinksUsed is used to provide some control over the depth of the search. Many wikipedia pages contain lots of links, and by exploring every one of them, you may not be able to quickly search very far away from your starting page. This gives some control over that.

(5 points) One thing the crawler is missing is a closedList. Add a closed list (implemented using a dictionary) to the crawler so that URLs that have already been added to the queue are not re-added.

At this point, we have a crawler, but it's not very clever. The reason for this is that we need a scorer. A scorer is an object that can tell us how "good" a page is. We'll build two scorers for this assignment: one that uses keywords, and one that uses a vector model.

scorer.py contains an abstract base class called scorer, and a sample subclass called randomScorer. Each of these has one method, called score() that takes one input - a wikipage.

(5 points) To begin, we'll implement a KeywordScorer. The KeywordScorer will be designed to take as input a keyword query and set the page's score to value between 0 and 1.
keywordScorer should have an __init__ method that takes one argument: a search string. Legal search strings are of the form: We will represent a keyword query as a string of the following form:

"word1 word2 (word3 word4 word5) word6 word7 (word8 word9) ..."

Where words inside parentheses are treated as ORed together, and all other words are ANDed. So, if we have the query

"cat dog bird (bunny snake) fish (horse lion)"

we would like to find documents that contain the words cat, dog, bird and fish, either bunny or snake (or both), and either horse or lion (or both).

To score a document, count the number of ANDed words that are contained within the document, along with the number of ORed clauses that have any word in the document. To normalize this, we then divide by the length of the query, which is the number of ANDed terms plus the number of ORed clauses.

For example, if we used the above query on a document containing the words 'cat cat bunny snake horse fish' we would have a score of 4 / 6, Notice that multiple occurrences of a word count as a single match. Your scorer should always return a value between 0 and 1.

Extend your main to add the option to use the keyword scorer:
python ./crawler.py -k "cat dog (fish bunny) frog"

Run your crawler on several different search queries. Try to fetch at least 10 matching pages. You will probably do best starting from specific pages that have some similarity to your queries.

(5 points) Vector modeling.

Our KeywordScorer has a few problems. For one thing, it doesn't take word frequency into account. If I do a keyword search with the query 'aardvark blue', documents containing one of these words are all scored equally, even though aardvark is a much less common word than blue.

Also, the KeywordScorer provides a particular mode of interaction between the user and the agent. For some tasks, using keywords may work fine, but in other cases, they may be cumbersome for users. For example, users often have a difficult time building complex queries: the average Google query is 2 or 3 words long. Also, a user may have a hard time coming up with keywords that specifically describe her interests. Instead, she might want to say "find more documents like these." To address this, we'll provide the FocusedCrawler with a set of pages (which we'll call a similarity set ) that represents the topics the user is interested in. The FocusedCrawler will use the similarity set to find other pages that are similar.

For this part, you'll implement a class called TFIDFScorer. It should derive from scorer and have two methods, __init__() and score(). __init__ should take as input a list of URLs which represent the similarity set. It should then use the wikipage class to fetch each of these. You should then combine all of the pages in the similarity set into a single wikipage - this will make them easier to work with.

Term Frequency

The first thing you'll need to add to wikipage is a dictionary that maps each word in a document to the number of times it occurs. (You probably already have this code from assignment 1). We will call this dictionary the term frequencies. Our intuition will be that words that occur frequently in the similarity set are useful words to look for.

But we don't want any frequently-occurring word; we want words that occur frequently in the similarity set, but less frequently in general. for example, 'aardvark' appearing 10 times in the similarity set seems more relevant than 'date'.

Document Frequency

To address this, we'll count how frequently words occur in a random set of documents. We'll refer to the frequency of words in documents in general the document frequency.

To compute the document frequency, you will need to select 500 randomly-chosen wikipedia pages. There is a method in crawler.py that will do this for you. (note: this takes 10-15 minutes. Do not wait until the last minute to do this!) We will refer to this body of text as a corpus. It's meant to be a statistically significant estimation of the actual frequencies of words in English.

You will need to extend getRandomPages to do the following: for each word in all of the pages, count the number of documents it appears in. This is the document frequency of the word. For example, if 'cat' occurs in 3 documents, it has a document frequency of 3 (multiple occurrences in a document do not count extra).


Term frequency and document frequency allow us to assign a score to each word in the similarity set indicating its relative strength in characterizing the documents in the set. We compute this score using a formula known as TFIDF (term frequency * inverse document frequency).

Each word in the similarity set should be assigned the following score:

termFrequency(word) * log(sizeOfCorpus / documentFrequency(word))

(If a word appears in the similarity set but not in the corpus, you should use 1 for its document frequency)

Documents as Vectors

Why are we doing this? It allows us to construct a model of a text document as an n-dimensional vector, where n is the number of words in the corpus, and the value of the nth element in the vector is that word's TFIDF score with respect to that document.

We can now construct a vector that represents the TFIDF scores of all words in the similarity set.

Scoring a page

Your TFIDFScorer should have a method called score() that takes a wikipage as input. This is the page that our crawler is exploring; we'll call it the query page.

Your scorer should construct a vector representing the query page in the same way the vector was constructed for the similarity set: remove stop words etc., count term frequencies for each word, and compute TFIDF using the document frequencies from the corpus.

At this point, we have two vectors to compare. To do this, we will measure the angle between the vectors. Documents that are identical will have identical vectors, and an angle of 0. Documents that are completely dissimiliar will have an angle of 90 (or pi/2, in radians). In fact, rather than measuring the angle itself, we'll measure the cosine of the angle. This is easier to compute, and also has the advantage of being in the range (0,1), like our other scorer. An angle of degree 0 has a cosine of 1, and an angle of degree 90 has a cosine of 0. (higher cosine == more similar)

To compute the cosine of the two vectors, use the following formula:

In words, the numerator tells us how similar the documents are. For each word in the query page's vector, multiply its weight by its weight in the similarity page's vector. (If a word is absent from either vector, its weight is 0). Sum up all the weights.

The denominator normalizes this score by dividing by the total length of the vector. (This is Pythagoras' rule). For each word in the similarity vector, square its weight. Add up these weights and take the square root of the sum. Do the same for the query vector. The product of those two square roots is the vector's length.

This cosine is the page's score. You're done! Your TFIDFScorer should return this so that your focused crawler can use it to enqueue pages.

Modify your main to take as input a file that contains a list of URLs that comprise the similarity set.
python ./crawler.py -t similaritySetFile

Run your focused crawler with the TFIDFScorer. You will have better performance if you start with a well-focused similarity set, and at a crawl somewhat "close" to the similarity set.

(5 points) Finally, run each scorer on a larger crawl. Choose an interesting query for each one and collect 50 pages that match the query (you may choose a degree of matching that you think makes sense). Examine each of the resulting pages by hand and determine how many of them you would consider to actually match the query. (In other words, if you put this query into Google and got this page back, you'd call it a success.) The number of successes divided by the total number of pages is the accuracy. Compute the accuracy for each scorer and prepare a table with this information.


List comprehensions are your friend. They make it very easy to (for example) compute a list of weights for a long list of words.

Vectors should be stored in dictionaries, as should frequency counts.

You only need to compute the document frequencies once. Don't do it every time you run your crawler (it takes a while). Compute them once and write the object out to a file. Next time you run your crawler, read it back in. To do this, use the pickle module. dump() and load() are the methods you'll be most interested in. (Just read it in once each time you run the crawler)

sqrt and log are in the math module, which must be imported. reduce() is a very nice way to apply a function to a list of numbers and get a scalar back. It works very nicely in conjunction with list comprehensions. For example:

>>> l = [1,2,3,4,5]
>>> def add(x,y) :
... 	return x + y
>>> reduce(add, l)
If all you need to do is add up the elements of a list, you can also use sum().

If you want to know more about TFIDF, and vector models for comparing documents in general, you may be interested in the following documents:

(these are not required reading; I'm just providing them in case you want more information)