CS 662: AI Programming
Assignment 3: Focused Crawling
Assigned: September 13
Due: September 20.September 25
30 points total.
Hard copies of the source code, along with the table showing the
accuracy of each scorer.
Also, please check your code into your subversion repository. Create a
subdirectory called assigment3 for it. Everything necessary to run
your code should be in this directory. If anything out of the
ordinary is needed to run your code, please provide a README.
In this assignment, you will build a focused crawler. A focused
crawler is a webcrawler that looks for pages that fit a particular
description. We can implement this using our typical search
paradigm. The pseudocode looks like this:
get a start page
while (not done)
fetch that page and extract its outward links
if page is "good enough", keep it
enqueue each of the outward links according to an estimate of their
value
To make life simple for this assignment, I will be providing you with
a great deal of the code that you need. In addition, we will not be
crawling the entire Web. Instead, we will be working with a local copy
of Wikipedia. (this means we don't need to worry about malformed
pages, spider traps, servers timing out, or any other number of
annoyances.)
NOTE: you are only to use the local copy of wikipedia installed on
scorpio.cs.usfca.edu for this assignment. In particular, do NOT crawl
wikipedia's main site (www.wikipedia.org). This is a violation of
their terms of use, and could result in other users at USF being
blocked from using wikipedia. Anyone found to be crawling
wikipedia.org (or any other non-local mirror) will receive an
automatic zero for this assignment.
NOTE ALSO: this assignment may take you some time to complete. In
particular, fetching several hundred pages from scorpio may take a few
minutes (running locally) or longer (if running from a slow connection
at home). Some advice:
- Test your code carefully before doing a large crawl.
- Run your code locally on the lab machines if at all
possible. This will reduce network latency.
- Learn how to set up a crawl to run in the background.
- DO NOT wait until the last minute to do this. "scorpio was
really slow because everyone was pounding it with requests" will not
be accepted as a reason for turning in your code late.
I have provided you with three classes that do the bulk of the work;
your job will be to fill in the gaps.
To begin, wikipage.py contains a
class for storing wikipedia pages. It's almost complete, but needs a
little work from you.
(5 points) Fix wikipage.py to remove stopwords and non-words
(anything containing punctuation). You MUST use a list comprehension
to do this. The goal here is to get rid of the 'noise' in the
document: words that don't help us determine its content.
(5 points) Currently, wikipage.py stores all hrefs in the
outwardLinks variables. Use one or more (you only need one) list
comprehensions to remove: external links to sites other than
scorpio, navigation links (they begin with '#') and links to edit
pages (they contain the string "action=edit"). The goal here is
to only keep only links to content-containing Wikipedia pages.
Next, crawler.py contains an
almost-complete focused crawler. Once you've fixed wikipage, you can
try it out like this:
import crawler
import scorer
c = crawler.crawler()
s = scorer.randomScorer()
c.crawl(10, 0.1, s)
scorer can be found here. More on that
in a minute.
crawl() takes six arguments, all of which have defaults: nGoalpages
(the number of pages to find), nTotalPages (the total number of
pages to crawl), threshold (how good a page's score
must be for it to be kept), fractionLinksUsed (what fraction of a
page's links should be followed) scorer (an object that scores a page),
and startingURL (where to start the crawl).
fractionLinksUsed is used to provide some control over the depth of
the search. Many wikipedia pages contain lots of links, and by
exploring every one of them, you may not be able to quickly search
very far away from your starting page. This gives some control over
that.
(5 points) One thing the crawler is missing is a closedList. Add
a closed list (implemented using a dictionary) to the crawler so
that URLs that have already been added to the queue are not
re-added.
At this point, we have a crawler, but it's not very clever. The reason
for this is that we need a scorer. A scorer is an object that can
tell us how "good" a page is. We'll build two scorers for this
assignment: one that uses keywords, and one that uses a vector
model.
scorer.py contains an abstract base
class called scorer, and a sample subclass called randomScorer. Each
of these has one method, called score() that takes one input - a
wikipage.
(5 points) To begin, we'll implement a KeywordScorer. The
KeywordScorer will be designed to take as input a keyword query and
set the page's score to value between 0 and 1.
keywordScorer should have an __init__ method that takes one argument:
a search string. Legal search strings are of the form:
We will represent a keyword query as a string of the following form:
"word1 word2 (word3 word4 word5) word6 word7 (word8 word9) ..."
Where words inside parentheses are treated as ORed together, and all
other words are ANDed. So, if we have the query
"cat dog bird (bunny snake) fish (horse lion)"
we would like to find documents that contain the words cat, dog, bird
and fish, either bunny or snake (or both), and either horse or lion
(or both).
To score a document, count the number of ANDed words that are
contained within the document, along with the number of ORed clauses
that have any word in the document. To normalize this, we then divide
by the length of the query, which is the number of ANDed terms plus
the number of ORed clauses.
For example, if we used the above query on a document containing the
words 'cat cat bunny snake horse fish' we would have a score of 4 / 6,
Notice that multiple occurrences of a word count as a single
match. Your scorer should always return a value between 0 and 1.
Extend your main to add the option to use the keyword scorer:
python ./crawler.py -k "cat dog (fish bunny) frog"
Run your crawler on several different search queries. Try to fetch at
least 10 matching pages. You will probably do best starting from
specific pages that have some similarity to your queries.
(5 points) Vector modeling.
Our KeywordScorer has a few problems. For one thing, it doesn't
take word frequency into account. If I do a keyword search with the
query 'aardvark blue', documents containing one of these words are
all scored equally, even though aardvark is a much less common word
than blue.
Also, the KeywordScorer provides a particular mode of
interaction between the user and the agent. For some tasks, using
keywords may work fine, but in other cases, they may be cumbersome for
users. For example, users often have a difficult time building complex
queries: the average Google query is 2 or 3 words long. Also, a user
may have a hard time coming up with keywords that specifically
describe her interests. Instead, she might want to say "find more
documents like these." To address this, we'll provide
the FocusedCrawler with a set of pages (which we'll call a
similarity set ) that represents the topics the user is interested
in. The FocusedCrawler will use the similarity set to find other pages
that are similar.
For this part, you'll implement a class called TFIDFScorer. It should
derive from scorer and have two methods, __init__() and
score(). __init__ should take as input a list of URLs which represent
the similarity set. It should then use the wikipage class to fetch
each of these. You should then combine all of the pages in the
similarity set into a single wikipage - this will make them easier to
work with.
Term Frequency
The first thing you'll need to add to wikipage is a dictionary that
maps each word in a document to the number of times it occurs. (You
probably already have this code from assignment 1). We will call this
dictionary the term frequencies. Our intuition will be that
words that occur frequently in the similarity set are useful words to
look for.
But we don't want any frequently-occurring word; we want words that
occur frequently in the similarity set, but less frequently in
general. for example, 'aardvark' appearing 10 times in the similarity
set seems more relevant than 'date'.
Document Frequency
To address this, we'll count how frequently words occur in a random
set of documents. We'll refer to the frequency of words in documents
in general the document frequency.
To compute the document frequency, you will need to select 500
randomly-chosen wikipedia pages. There is a method in crawler.py that
will do this for you. (note: this takes 10-15 minutes. Do not wait
until the last minute to do this!) We will refer to this body of text
as a corpus. It's meant to be a statistically significant
estimation of the actual frequencies of words in English.
You will need to extend getRandomPages to do the following: for each
word in all of the pages, count the number of
documents it appears in. This is the document frequency of the
word. For example, if 'cat' occurs in 3 documents, it has a document
frequency of 3 (multiple occurrences in a document do not count extra).
TFIDF
Term frequency and document frequency allow us to assign a score to
each word in the similarity set indicating its relative strength in
characterizing the documents in the set. We compute this score using a
formula known as TFIDF (term frequency * inverse document frequency).
Each word in the similarity set should be assigned the following
score:
termFrequency(word) * log(sizeOfCorpus / documentFrequency(word))
(If a word appears in the similarity set but not in the corpus, you
should use 1 for its document frequency)
Documents as Vectors
Why are we doing this? It allows us to construct a model of a text
document as an n-dimensional vector, where n is the
number of words in the corpus, and the value of the nth element
in the vector is that word's TFIDF score with respect to that
document.
We can now construct a vector that represents the TFIDF scores of all
words in the similarity set.
Scoring a page
Your TFIDFScorer should have a method called score() that takes a
wikipage as input. This is the page that our crawler is exploring;
we'll call it the query page.
Your scorer should construct a vector representing the query page in
the same way the vector was constructed for the similarity set: remove
stop words etc., count term frequencies for each word, and compute
TFIDF using the document frequencies from the corpus.
At this point, we have two vectors to compare. To do this, we will
measure the angle between the vectors. Documents that are identical
will have identical vectors, and an angle of 0. Documents that are
completely dissimiliar will have an angle of 90 (or pi/2, in
radians). In fact, rather than measuring the angle itself, we'll
measure the cosine of the angle. This is easier to compute, and also
has the advantage of being in the range (0,1), like our other scorer.
An angle of degree 0 has a cosine of 1, and an angle of degree 90 has
a cosine of 0. (higher cosine == more similar)
To compute the cosine of the two vectors, use the following formula:
In words, the numerator tells us how similar the documents are. For
each word in the query page's vector, multiply its weight by its weight in
the similarity page's vector. (If a word is absent from either vector, its
weight is 0). Sum up all the weights.
The denominator normalizes this score by dividing by the total length
of the vector. (This is Pythagoras' rule). For each word in the
similarity vector, square its weight. Add up these weights and take the
square root of the sum. Do the same for the query vector. The product of
those two square roots is the vector's length.
This cosine is the page's score. You're done! Your TFIDFScorer should
return this so that your focused crawler can use it to enqueue pages.
Modify your main to take as input a file that contains a list of URLs that comprise the similarity set.
python ./crawler.py -t similaritySetFile
Run your focused crawler with the TFIDFScorer. You will have better
performance if you start with a well-focused similarity set, and at a
crawl somewhat "close" to the similarity set.
(5 points) Finally, run each scorer on a larger crawl. Choose
an interesting query for each one and collect 50 pages that match the
query (you may choose a degree of matching that you think makes
sense). Examine each of the resulting pages by hand and determine how
many of them you would consider to actually match the query. (In other
words, if you put this query into Google and got this page back, you'd
call it a success.) The number of successes divided by the total
number of pages is the accuracy. Compute the accuracy for each
scorer and prepare a table with this information.
Hints:
List comprehensions are your friend. They make it very easy to (for example)
compute a list of weights for a long list of words.
Vectors should be stored in dictionaries, as should frequency counts.
You only need to compute the document frequencies once. Don't
do it every time you run your crawler (it takes a while). Compute
them once and write the object out to a file. Next time you run your
crawler, read it back in. To do this, use the pickle module. dump()
and load() are the methods you'll be most interested in. (Just read it
in once each time you run the crawler)
sqrt and log are in the math module, which must be imported.
reduce() is a very nice way to apply a function to a list of numbers and get
a scalar back. It works very nicely in conjunction with list
comprehensions. For example:
>>> l = [1,2,3,4,5]
>>> def add(x,y) :
... return x + y
...
>>> reduce(add, l)
15
If all you need to do is add up the elements of a list, you can also use sum().
If you want to know more about TFIDF, and vector models for comparing
documents in general, you may be interested in the following
documents:
(these are not required reading; I'm just providing them in case you
want more information)