CS 662: AI Programming
Assigned: September 12
Assignment 3: Search
Due: September 21. (note - date extended!)
60 points total.
What to turn in: written answers for question 1, plus hard copies of
the source code. Also, typed answers forthe two questions marked
Also, please put a copy of your code in the submit directory for this
class: /home/submit/cs662/(yourname)/assignment3. Everything necessary
to run your code should be in this directory. If anything out of the
ordinary is needed to run your code, please provide a README.
Question 1. Search. (20 points total, 5 points each)
Trace the execution of BFS, DFS, greedy search, and A* search on the
Romania graph, starting at Fagaras and ending at Dobreta. You may skip
cities that have already been visited. At each step, show the node
being visited, and the queue. For A*, also indicate f, g, and h for
each node in the queue.
For example, if we were doing BFS from Arad to Bucharest, your output
should look like:
Arad [Sibiu, Timisoara, Zerind]
Sibiu [Timisoara, Zerind, Oradea, Fagaras, Rimnicu Vilcea]
Timisoara [Zerind, Oradea, Fagaras, Rimnicu Vilcea, Lugoj]
Update! Please use the following values as the heuristic from each city to Dorbeta:
Arad 240 Mehadia 75
Bucharest 315 Neamt 450
Craiova 120 Oradea 350
Drobeta 0 Pitesti 210
Eforie 460 Rimnicu Vilcea 180
Fagaras 300 Sibiu 270
Giurgiu 305 Timisoara 175
Hirsova 440 Urziceni 370
Iasi 410 Vaslui 385
Lugoj 140 Zerind 290
Question 2: Focused crawler (40 points total)
For this part of the assignment, you will build a focused
crawler. A focused crawler is a webcrawler that looks for pages
that fit a particular description. We can implement this using our
typical search paradigm. The pseudocode looks like this:
get a start page
while (not done)
fetch that page and extract its outward links
if page is "good enough", keep it
enqueue each of the outward links according to an estimate of its value
To make life simple for this assignment, I will be providing you with
a great deal of the code that you need. In addition, we will not be
crawling the entire Web. Instead, we will be working with a local copy
of Wikipedia. (this means we don't need to worry about malformed
pages, spider traps, servers timing out, or any other number of
NOTE: you are only to use the local copy of wikipedia installed on
scorpio.cs.usfca.edu for this assignment. In particular, do NOT crawl
wikipedia's main site (www.wikipedia.org). This is a violation of
blocked from using wikipedia. Anyone found to be crawling
wikipedia.org (or any other non-local mirror) will receive an
automatic zero for this assignment.
NOTE ALSO: this assignment may take you some time to complete. In
particular, fetching several hundred pages from scorpio may take several
minutes (running locally) or longer (if running from a slow connection
at home). Some advice:
- Test your code carefully before doing a large crawl.
- Run your code locally on the lab machines if at all
possible. This will reduce network latency.
- Learn how to set up a crawl to run in the background.
- DO NOT wait until the last minute to do this. "scorpio was
really slow because everyone was pounding it with requests" will not
be accepted as a reason for turning in your code late.
I have provided you with three classes that do the bulk of the work;
your job will be to fill in the gaps.
To begin, wikipage.py contains a
class for storing wikipedia pages. It's almost complete, but needs a
little work from you.
(5 points) Fix wikipage.py to remove stopwords and non-words
(anything containing punctuation). You MUST use a list comprehension
to do this. The goal here is to get rid of the 'noise' in the
document: words that don't help us determine its content. You should
err on the side of deleting real words, rather than leaving
non-words in the document.
(5 points) Currently, wikipage.py stores all hrefs in the
outwardLinks variables. Use one or more (you only need one) list
comprehensions to remove: external links to sites other than
scorpio, navigation links (they begin with '#') and links to edit
pages (they contain the string "action=edit"). The goal here is
to only keep only links to content-containing Wikipedia pages.
Next, crawler.py contains an
almost-complete focused crawler. Once you've fixed wikipage, you can
try it out like this:
c = crawler.crawler()
s = scorer.randomScorer()
c.crawl(nGoalpages=10, ntotalPages=100, threshold=0.1, scorer=s)
scorer can be found here. More on that
in a minute.
crawl() takes six arguments, all of which have defaults: nGoalpages
(the number of pages to find), nTotalPages (the total number of
pages to crawl), threshold (how good a page's score
must be for it to be kept), fractionLinksUsed (what fraction of a
page's links should be followed) scorer (an object that scores a page),
and startingURL (where to start the crawl).
fractionLinksUsed is used to provide some control over the depth of
the search. Many wikipedia pages contain lots of links, and by
exploring every one of them, you may not be able to quickly search
very far away from your starting page. This gives some control over
The crawler starts at the start URL (a random page if none is given),
fetches it, scores it, and extracts the outward links. Each of those
outward links are then fetched and scored. They are placed in the
search queue (implemeted using a heap). The fronmost node is then
dequeued, and the process continues. Once nTotalPages have been
fetched, the crawler will quit following external links. It will
continue to evaluate pages already in the queue until the queue is
(5 points) One thing the crawler is missing is a closedList. Add
a closed list (implemented using a dictionary) to the crawler so
that URLs that have already been added to the queue are not
At this point, we have a crawler, but it's not very clever. The reason
for this is that we need a scorer. A scorer is an object that can
tell us how "good" a page is. We'll build two scorers for this
assignment: one that uses keywords, and one that uses a vector
scorer.py contains an abstract base
class called scorer, and a sample subclass called randomScorer. Each
of these has one method, called score() that takes one input - a
(10 points) To begin, we'll implement a KeywordScorer. It will
be designed to take as input a keyword query (like what you'd give to
Google) and set the page's score to value between 0 and 1.
keywordScorer should have an __init__ method that takes one argument:
a search string. Legal search strings are of the form:
We will represent a keyword query as a string of the following form:
"word1 word2 (word3 word4 word5) word6 word7 (word8 word9) ..."
Where words inside parentheses are treated as ORed together, and all
other words are ANDed. So, if we have the query
"cat dog bird (bunny snake) fish (horse lion)"
we would like to find documents that contain the words cat, dog, bird
and fish, either bunny or snake (or both), and either horse or lion
To score a document, count the number of ANDed words that are
contained within the document, along with the number of ORed clauses
that have any word in the document. To normalize this, we then divide
by the length of the query, which is the number of ANDed terms plus
the number of ORed clauses.
For example, if we used the above query on a document containing the
words 'cat cat bunny snake horse fish' we would have a score of 4 / 6,
Notice that multiple occurrences of a word count as a single
match. Your scorer should always return a value between 0 and 1.
You should also write a main for crawler.py that allows it to be run
from the command line. Provide a command-line argument that allows the
user to specify a scorer. For example:
python ./crawler.py -k "cat dog (fish bunny) frog"
Run your crawler on several different search queries. Try to fetch at
least 10 matching pages. You will probably do best starting from
specific pages that have some similarity to your queries. I would also
suggest limiting the size of yor crawl to small values (say 100) at
first until you are sure it works. Then you might want to try a larger
crawl (say 500) to see how your crawler does.
Answer this question!! Do a crawl of at least 100 pages with
a query using at least 3 terms. Examine the pages that are found by
hand. How does your crawler do? If there are errneous pages, why do
you think they were discovered? What could be done to correct this?
(15 points) Vector modeling.
Our KeywordScorer has a few problems. For one thing, it doesn't
take word frequency into account. If I do a keyword search with the
query 'aardvark blue', documents containing one of these words are
all scored equally, even though aardvark is a much less common word
Also, the KeywordScorer provides a particular mode of
interaction between the user and the agent. For some tasks, using
keywords may work fine, but in other cases, they may be cumbersome for
users. For example, users often have a difficult time building complex
queries: the average Google query is 2 or 3 words long. Also, a user
may have a hard time coming up with keywords that specifically
describe her interests. Instead, she might want to say "find more
documents like these." To address this, we'll provide
the FocusedCrawler with a set of pages (which we'll call a
similarity set ) that represents the topics the user is interested
in. The FocusedCrawler will use the similarity set to find other pages
that are similar.
For this part, you'll implement a class called TFIDFScorer. It should
derive from scorer and have two methods, __init__() and
score(). __init__ should take as input a list of URLs which represent
the similarity set. It should then use the wikipage class to fetch
each of these. You should then combine all of the pages in the
similarity set into a single wikipage - this will make them easier to
The first thing you'll need to add to wikipage is a dictionary that
maps each word in a document to the number of times it occurs. (You
probably already have this code from assignment 1). We will call this
dictionary the term frequencies. Our intuition will be that
words that occur frequently in the similarity set are useful words to
But we don't want any frequently-occurring word; we want words that
occur frequently in the similarity set, but less frequently in
general. for example, 'aardvark' appearing 10 times in the similarity
set seems more relevant than 'date'.
To address this, we'll count how frequently words occur in a random
set of documents. We'll refer to the frequency of words in documents
in general the document frequency.
To compute the document frequency, you will need to select 500
randomly-chosen wikipedia pages. There is a method in crawler.py that
will do this for you. (note: this takes 10-15 minutes. Do not wait
until the last minute to do this!) We will refer to this body of text
as a corpus. It's meant to be a statistically significant
estimation of the actual frequencies of words in English.
You will need to extend getRandomPages to do the following: for each
word in all of the pages, count the number of
documents it appears in. This is the document frequency of the
word. For example, if 'cat' occurs in 3 documents, it has a document
frequency of 3 (multiple occurrences in a document do not count extra).
Term frequency and document frequency allow us to assign a score to
each word in the similarity set indicating its relative strength in
characterizing the documents in the set. We compute this score using a
formula known as TFIDF (term frequency * inverse document frequency).
Each word in the similarity set should be assigned the following
termFrequency(word) * log(sizeOfCorpus / documentFrequency(word))
(If a word appears in the similarity set but not in the corpus, you
should use 1 for its document frequency)
Documents as Vectors
Why are we doing this? It allows us to construct a model of a text
document as an n-dimensional vector, where n is the
number of words in the corpus, and the value of the nth element
in the vector is that word's TFIDF score with respect to that
We can now construct a vector that represents the TFIDF scores of all
words in the similarity set.
Scoring a page
Your TFIDFScorer should have a method called score() that takes a
wikipage as input. This is the page that our crawler is exploring;
we'll call it the query page.
Your scorer should construct a vector representing the query page in
the same way the vector was constructed for the similarity set: remove
stop words etc., count term frequencies for each word, and compute
TFIDF using the document frequencies from the corpus.
At this point, we have two vectors to compare. To do this, we will
measure the angle between the vectors. Documents that are identical
will have identical vectors, and an angle of 0. Documents that are
completely dissimiliar will have an angle of 90 (or pi/2, in
radians). In fact, rather than measuring the angle itself, we'll
measure the cosine of the angle. This is easier to compute, and also
has the advantage of being in the range (0,1), like our other scorer.
An angle of degree 0 has a cosine of 1, and an angle of degree 90 has
a cosine of 0. (higher cosine == more similar)
To compute the cosine of the two vectors, use the following formula:
In words, the numerator tells us how similar the documents are. For
each word in the query page's vector, multiply its weight by its weight in
the similarity page's vector. (If a word is absent from either vector, its
weight is 0). Sum up all the weights.
The denominator normalizes this score by dividing by the total length
of the vector. (This is Pythagoras' rule). For each word in the
similarity vector, square its weight. Add up these weights and take the
square root of the sum. Do the same for the query vector. The product of
those two square roots is the vector's length.
This cosine is the page's score. You're done! Your TFIDFScorer should
use this pavue to set the page's score so that your focused crawler
can use it to enqueue pages.
Run your focused crawler with the TFIDFScorer. You will have better
performance if you start with a well-focused similarity set, and at a
crawl somewhat "close" to the similarity set.
Answer this question!!! Do a run of the same size as you did
for the KeywordScorer. How do the two compare? Are there some cases
where one does better than the other? Why do you think that is?
List comprehensions are your friend. They make it very easy to (for example)
compute a list of weights for a long list of words.
Vectors should be stored in dictionaries, as should frequency counts.
You only need to compute the document frequencies once. Don't
do it every time you run your crawler (it takes a while). Compute
them once and write the object out to a file. Next time you run your
crawler, read it back in. To do this, use the pickle module. dump()
and load() are the methods you'll be most interested in. (Just read it
in once each time you run the crawler)
sqrt and log are in the math module, which must be imported.
reduce() is a very nice way to apply a function to a list of numbers and get
a scalar back. It works very nicely in conjunction with list
comprehensions. For example:
>>> l = [1,2,3,4,5]
>>> def add(x,y) :
... return x + y
>>> reduce(add, l)
If all you need to do is add up the elements of a list, you can also use sum().
If you want to know more about TFIDF, and vector models for comparing
documents in general, you may be interested in the following
(these are not required reading; I'm just providing them in case you
want more information)