CS 662: AI Programming
Assigned: September 22
Project #1: Focused Web Crawling
Due: Wed., October 12, 5 pm
Total points: 100
A web crawler is a program that 'crawls the web' - it fetches a web
page, extracts all the references to other documents, and fetches each
of those documents. For each of those documents, it extracts their
references to other documents, and so on. In this way, it can explore
the web on a automated basis.
A focused web crawler is a web crawler that looks for pages
about a given topic, or that match some externally provided criteria.
For example, it might only retrieve pages about basketball. In this
project, you'll build a focused web crawler and experiment with two
different methods for telling it what the user is interested in.
We can describe a web crawler as a search agent, and use any of the
algorithms we've studied so far. For example, an unfocused web crawler
might choose to use breadth-first search. Our focused crawler will
instead use a form of best-first search, where we put a
priority on exploring hyperlinks that look to be promising.
In a nutshell, the focused crawler will work as follows:
- Begin with a 'seed page' provided by the user.
- Enqueue this page.
- While not done:
- Dequeue a page and fetch each of the pages it links to.
- For each newly fetched page:
- Extract its text and determine how good a match it is.
- Enqueue the new pages according to priority; better matches will
go at the front of the queue.
Each of the subtasks are described below, along with the points that
that task is worth. Within a task, we'll assign 75% of the points
based on functionality (how well does your code work? Does it do what
it's supposed to?) and 25% of the points based on style. (Is your code
commented? Is it readable? Is it efficient? etc.) For parts 3 and 4,
answering the questions will count as part of the 'style' grade.
You should turn in the following:
- A hard copy of your code.
- Place a copy of your code in the submit directory.
- Turn in a 1-2 page document explaining how to run your agent and
answering the questions at the end of parts 3 and 4. (note: we should
be able to run your agent for all configurations and scorers
without modifying your code at all. "comment out these lines
and uncomment out these other lines" is not a sufficient user interface.)
Late penalties will be subtracted from the final score. Since the
project is worth 100 points, one day late is a loss of 10 points.
So, if you had a score of 85 and were one day late, you would get a
75. (Hint: turn things in on time!)
For this project, you are welcome (and encouraged) to use any of the
built-in python modules. I'll make specific suggestions during the
description for each part. If you use third-party code, please
document it properly. As always, everyone is expected to do their own
work. Code that is plagiarized from other students or from the web
will result in a zero on this project.
In the remainder of this document, we'll step through the pieces that
you'll need to build a focused web crawler.
The Document class (35 points)
In this part, you'll build a Document class. This will serve as a
representation of the information of interest in a web page. It should
have (at least) the following data members:
You should be able to create a Document object as follows:
- url - The URL for this document.
- outwardLinks - a list containing all URLs found in this page.
- text - a list of all significant words in this page.
- score - the 'value' of this page.
- title - the title of the page.
- headings - a list of all significant words found within
<h1> or <h2> tags.
>> import project1
>> d = project1.Document('http://www.usfca.edu')
'http://www.usfca.edu/online/about_USF/', ... ]
['University', 'San', 'Francisco', 'About', 'USF', 'Academic',
'Programs', 'Prospective', 'Students', 'Current', 'Students',
'Faculty', 'Staff', 'Alumni', 'Giving', 'News', 'Events', 'Athletics',
'Site', 'Index', 'Search', 'USF', 'Contact', 'Us']
How to do this?
You'll want to take advantage of Python's SGMLParser class. SGML is a
general family of markup languages, of which HTML is one instance. The
SGMLParser can be subclassed to build an HTMLparser that suits our needs.
Let's say you call it DocumentProcessor.
In particular, you can define methods that are called when a
particular tag is encountered. For outward links, we'll be
particularly interested in 'a' (or anchor) tags, so you'll want to
override start_a. similarly, for title, h1 and and h2, you'll want to
override start_title, start_h1 and start_h2
Dive Into Python has a really nice set of examples that will help
you see how to do this.
To extract the text, you'll want the DocumentProcessor to override
SGMLParser's handle_data method. You'll also want to filter the text
that's in the page: much of it is not very helpful for determining
what a page is about.
To begin with, filter out any words that contain a non-alphabet
character. This includes dates, numbers, and markup elements. You may
find the string module, in particular, the isalpha() method, to be of
use in this.
Next, we'll remove stop words. These are words such as a, an,
he, she, and the that have little or no information in them. You can
find a list of stop words here.
So, your Document class should use the DocumentProcessor to filter out the
URLs from a document and store them in the outwardLinks variable. It
should take all text contained within h1 and h2 tags and store them in
the respective variables in the Document class. It should also take
the stream of text in the page from the DocumentProcessor, split it into
words, and remove anything that either contains a non-alpha character
or is a stop word.
One wrinkle that you'll need to deal with: Many web pages contain
href's that are relative URLs ; that is, they don't begin with
"http://" and a hostname. These refer to other documents on the same
server. You'll need to catch these and add the 'http://' and hostname
Your program is only required to deal with plain text and HTML. You
may discard URLS referring to all other document types.
- The Web is full of malformed pages. Your program will need to
process large numbers of pages, some of which are not going to be
valid HTML. Your program will need to be able to recove gracefully if
the DocumentProcessor is unable to parse a page.
- The Web is also full of flaky servers. You may send an http
request to a server and never get a reply. Your program should deal
with this by setting a timer. I recommend using socket.settimeout() to
- You may also run into https:// URLs that expect you to provide a
username and password. The HTTPBasicAuthHandler object in the urllib2
moodule can be used to automatically provide a username and password
(which will likely be incorrect, but it will keep your program from
- Take advantage of Python's features; if you find yourself doing
this in a C-style manner, there's probably an easier way.
- Remember Python's motto: "Batteries Included." Many common tasks
have built-in functions or modules.
- s.strip(str) will return a new string with leading or trailing
instances of str removed.
Without an argument, it will strip whitespace.
- List comprehensions are a very effective way to quickly filter
and transform a list. For example, [item.lower() for item in ['Cat',
'Dog', 'Bird'] if len(item) < 4] returns ['cat', 'dog']
The FocusedCrawler class (25 points)
The next step is to build an agent, which we'll call a FocusedCrawler,
that can collect a set of pages for us.
FocusedCrawler should have (at least) the following data members:
- root - the page where the crawl begins.
- scorer - an object that can score a page.
- ignoreInternal - a flag that tells whether to crawl internal
FocusedCrawler should have a crawl() method that takes two arguments, both
with default values: nresults, which indicates how many web pages
we want the crawler to find, and threshold, which indicates 'how
similar' a web page must be for the crawler to keep it.
The crawler should use the heapq module to build a priority
queue. It should begin by creating a Document object using the root
URL. It should then insert (score, URL) tuples representing each of
the outward links into the priority queue.
As a (score, URL) tuple is dequeued, you should use the URL to create
a new Document class. If the score of that Document is above the
user-provided threshold, then save the document. Extract all outward
links from the Document and enqueue them as (score, URL) tuples, where
score is the score of the document that contained the outward link.
One thing we need to be aware of when crawling the web is not
overloading remote servers. In order to help direct crawlers and
spiders, many servers will use a file known as robots.txt to indicate
which directories a crawler is allowed to visit.
Python has a module called robotparser that automates much of this
process. You can use it to fetch a robots.txt file (if one exists) and
then use the can_fetch method to determine whether a URL is 'safe' to
Your crawler should use the robotparser to check for robots.txt
permissions before visiting any sites.
We'll also want to make sure that we don't revisit pages that we've
seen before by maintaining a closed list. For this project we'll keep
it simple and just maintain a 'closed dictionary' of URLs that have been
previously visited. Before creating a Document, check your closed
dictionary to see if the URL has already been visited. If so, ignore
it. If not, visit it and add it to the closed disctionary.
Once you've found enough pages with a score higher than the threshold,
you may terminate the crawl and return all results.
If ignoreInternal is True, then your PageCrawler should ignore
internal links, which are links to other pages whose URL has the
same hostname. For example, if we were crawling www.cs.usfca.edu with
ignoreInternal set to True, we would ingore links to
www.cs.usfca.edu/~brooks. (but not to www.usfca.edu, since
that's a different hostname.)
But how to score a page?
That will be the focus of parts 3 and 4. For the moment, you can
assign pages a random score between 0.0 and 1.0. (use the random
module for this.) For this part, focus on getting the search algorithm
One issue: the heapq module implements a min-heap; the smallest object
is always at the front. But we need a max-heap; the highest priority
page is at the front. I'll let you figure out how to solve this.
Query-driven search (15 points)
Our initial stab at finding webpages that match a user's interests
will be by using a keyword query . This is a set of keywords
(or disjunctions of keywords) that should appear in documents. I'm
sure you've used this sort of interface; it's what almost all search
engines use. In general, keyword queries are great if we know specific
words that are in documents of interest.
We will represent a boolean query as a string of the following form:
"word1 word2 (word3 word4 word5) word6 word7 (word8 word9) ..."
Where words inside parentheses are treated as ORed together, and all
other words are ANDed. So, if we have the query
"cat dog bird (bunny snake) fish (horse lion)"
we would like to find documents that contain the words cat, dog, bird
and fish, either bunny or snake (or both), and either horse or lion
To score a document, count the number of ANDed words that are
contained within the document, along with the number of ORed clauses
that have any word in the document. To normalize this, we then divide
by the length of the query, which is the number of ANDed terms plus
the number of ORed clauses.
For example, if we used the above query on a document containing the
words 'cat cat bunny snake horse fish' we would have a score of 4 / 6,
Notice that multiple occurrences of a word count as a single match.
Your scorer should always return a value between 0 and 1.
Your crawler should use the KeywordQueryScorer to assign a priority to
a web page. All the outward links (references to other pages) in this
page receive this priority when they are inserted the priority
queue. (Remember, highest priority at the front!) KeywordQueryScorer
should have a score() method that takes a page object as input and
returns the score as defined above.
The re (regular expression) module is very useful for parsing queries.
Answer this part!
You should now be able to feed your crawler
a root URL and a search query and let it try to find documents that
match the query above the threshold that you provide. Try several
queries, some with a small number of terms and some with a larger
number. Try both common and rare words. Look at the results that your
crawler returns. Experiment with different thresholds as well (0.75 is
a good starting point). What threshold seems to find documents that
fit your information needs (we'll call that precision) without missing
too many useful documents (we'll call that recall).
Write a paragraph characterizing the performance of your crawler at
this point. I'm particularly interested in your explanation of why
you think you are seeing the behavior you are seeing. Be
specific; for example, how accurate do you think your crawler is? Does
it find documents that seem to match what you were looking for? How
many documents does it look at in order to find enough matches? If it
performs well, explain why that is. if you feel that the
performance is not as good as it could be, explain why that is
Topic-driven search (25 points)
Our KeywordQueryScorer has a few problems. For one thing, it doesn't
take word frequency into account. If I do a keyword search with the
query 'aardvark blue', documents containing one of these words are
all scored equally, even though aardvark is a much less common word
Also, the KeywordQueryScorer provides a particular mode of
interaction between the user and the agent. For some tasks, using
keywords may work fine, but in other cases, they may be cumbersome for
users. For example, users often have a difficult time building complex
queries: the average Google query is 2 or 3 words long. Also, a user
may have a hard time coming up with keywords that specifically
describe her interests. Instead, she might want to say "find more
documents like these." To address this, we'll provide
the FocusedCrawler with a set of pages (which we'll call a
similarity set ) that represents the topics the user is interested
in. The FocusedCrawler will use the similarity set to find other pages
that are similar.
For this part, you'll implement a class called TFIDFScorer. It should
have two methods, __init__() and score(). (You may find it helpful, or
at least aesthetically appealing, to make an abstract DocumentScorer class
and derive both scorers from it.)
__init__() should read a list of URLs in from a file and construct the
similarity set. To do this, create a Document object for each URL, and
then pass the text from each of these pages into a frequency
counter. This frequency counter should be implemented as a dictionary
that maps words to their number of occurrences in all documents in the
similarity set. we will call this dictionary the term
frequencies. Our intuition will be that words that occur
frequently in the similarity set are useful words to look for.
But we don't want any frequently-occurring word; we want words that
occur frequently in the similarity set, but less frequently in
general. for example, 'aardvark' appearing 10 times in the similarity
set seems more relevant than 'date'.
To address this, we'll count how frequently words occur in a random
set of documents. We'll refer to the frequency of words in documents
in general the document frequency.
To compute the document frequency, select 1000 randomly-chosen web
pages. We will refer to this body of text
as a corpus. It's meant to be a statistically significant
estimation of the actual frequencies of words in English.
There are 1000 pages in /home/public/cs662/corpus that you may use for
this. Alternatively, you may build your own. There are several engines
on the web that will return random web pages to you. (put 'random URL'
into google to find some.) I used http://random.yahoo.com/fast/ryl.
For each word in all of the pages (after removing stop words, tags,
and non-alphabetic words as above) , count the number of
documents it appears in. This is the document frequency of the
word. For example, if 'cat' occurs in 3 documents, it has a document
frequency of 3.
Term frequency and document frequency allow us to assign a score to
each word in the similarity set indicating its relative strength in
characterizing the documents in the set. We compute this score using a
formula known as TFIDF (term frequency * inverse document frequency).
Each word in the similarity set should be assigned the following
termFrequency(word) * log(sizeOfCorpus / documentFrequency(word))
(If a word appears in the similarity set but not in the corpus, you
should use 1 for its document frequency)
Documents as Vectors
Why are we doing this? It allows us to construct a model of a text
document as an n-dimensional vector, where n is the
number of words in the corpus, and the value of the nth element
in the vector is that word's TFIDF score with respect to that
We can now construct a vector that represents the TFIDF scores of all
words in the similarity set.
Let's add one more wrinkle. Back in part 1, we collected all of the
words in the title and h1 and h2 headers. Presumably those words are
particularly important in classifying the document. To acocunt for
this, add a multiplier for 2 to the TFIDF score of each word in an h2,
3, for each word in an h1, and 5 for each word in the title.
Scoring a page
Your TFIDFScorer should have a method called score() that takes a
Document as input. This is the page that our crawler is exploring;
we'll call it the query page.
Your scorer should construct a vector representing the query page in
the same way the vector was constructed for the similarity set: remove
stop words etc., count term frequencies for each word, and compute
TFIDF using the document frequencies from the corpus.
At this point, we have two vectors to compare. To do this, we will
measure the angle between the vectors. Documents that are identical
will have identical vectors, and an angle of 0. Documents that are
completely dissimiliar will have an angle of 90 (or pi/2, in
radians). In fact, rather than measuring the angle itself, we'll
measure the cosine of the angle. This is easier to compute, and also
has the advantage of being in the range (0,1), like our other scorer.
An angle of degree 0 has a cosine of 1, and an angle of degree 90 has
a cosine of 0. (higher cosine == more similar)
To compute the cosine of the two vectors, use the following formula:
In words, the numerator tells us how similar the documents are. For
each word in the query page's vector, multiply its weight by its weight in
the similarity page's vector. (If a word is absent from either vector, its
weight is 0). Sum up all the weights.
The denominator normalizes this score by dividing by the total length
of the vector. (This is Pythagoras' rule). For each word in the
similarity vector, square its weight. Add up these weights and take the
square root of the sum. Do the same for the query vector. The product of
those two square roots is the vector's length.
This cosine is the page's score. You're done! Your TFIDFScorer should
return this so that your FocusedCrawler can use it to enqueue pages.
List comprehensions are your friend. They make it very easy to (for example)
compute a list of weights for a long list of words.
Vectors should be stored in dictionaries, as should frequency counts.
You only need to compute the document frequencies once. Don't
do it every time you run your crawler (it takes a while). Compute
them once and write the object out to a file. Next time you run your
crawler, read it back in. To do this, use the pickle module. dump()
and load() are the methods you'll be most interested in. (Just read it
in once each time you run the crawler)
sqrt and log are in the math module, which must be imported.
reduce() is a very nice way to apply a function to a list of numbers and get
a scalar back. It works very nicely in conjunction with list
comprehensions. For example:
>>> l = [1,2,3,4,5]
>>> def add(x,y) :
... return x + y
>>> reduce(add, l)
If all you need to do is add up the elements of a list, you can also use sum().
(for those of you who've used Lisp or Scheme, Python also has a lambda
operator, which lets you avoid defining add separately. If you have no idea
what lambda is, don't worry about it.)
If you want to know more about TFIDF, and vector models for comparing
documents in general, you may be interested in the following
(these are not required reading; I'm just providing them in case you
want more information)
Answer this part!
Now run your crawler with the new TFIDFScorer. Build a similarity set
of at least 10 documents to characterize a user's interests. the
more specific they are, the better - a page about the rules of
baseball is better than the front page of cnn.com. Use a threshold of
0.75 as a metric for 'similar'. Look at the documents that your scorer
finds - would you consider them acceptably similar?
Write a paragraph comparing the performance of your crawler using the
TFIDFScorer to the crawler with the KeywordQueryScorer. Be specific;
for example, how many matches does each scorer find? What sorts of
documents do they do well on? What do they do poorly on? What sorts of
problems do both scorers face?
On Extra credit: Below are four possible tasks that you can complete
for extra credit. Each of these will require that you learn or explore
some topic beyond that covered in the project. They're intended to
encourage you to learn more on your own about topics that interest
Extra Credit Rules:
1. 5-10 points. Search efficiency. There are a number of practical
things that can be done to speed up our crawler. For example, much of
the execution time is spent waiting for a response from remote web
servers. The obvious solution is to fetch several pages at the same
- You may only do one of the extra credit problems for a
grade. (you're welcome to do them all for your own benefit, of
course.) If you do more than one and don't indicate which one you wish
to be graded on, we'll pick one at random.
- You must make a good-faith effort to complete all of the required
portions of the assignment first. If one or more of the required
sections are not done or barely done, we won't grade the extra
credit. (In other words, you can't do the extra credit instead of Part
There are two approaches to doing this. One is to use multiple
threads, and the other is to use select(). Extend the FocusedCrawler
to use either of these approaches for 5 pts each.
2. 5-10 points. Add a graphical front end to your client. The GUI
should have a way for the user to enter search queries and also
specify similar documents. It should also provide hypertext links for
each of the web pages returned. You may also provide additional
features, such as visualization of the search, integration with
Firefox, etc. Be creative! This is worth 5-10 points, depending on the
functionality, usability, and features provided by your GUI.
There are a number of different graphical toolkits available for
Python. These include:
3. 5 points. corpus construction. The accuracy of TFIDF depends on the
quality of the corpus that's used to construct document
frequencies. By customizing the corpus, we can hopefully improve
performance when searching in particular domains. For example, if we
know that all of our users are interested in Java, we can collect a
corpus of Java pages; this will provide a better model of word usage
for our task.
- TkInter. This is an interface to the Tk toolkit. If you
use TKinter, you will probably need to code the GUI by hand. (It's
similar to using the AWT in Java.) You can
find more information on using Tkinter here,
or in O'Reilly's Programming Python book.
- Qt is a third-party cross-platorm GUI toolkit developed by
Trolltech. PyQt is a Python interface to Qt (which is written in
C++). If you use Qt, you may use the qtdesigner tool to lay out
your GUI. Our beloved sysadmin Alex has a tutorial on using qtdesigner
to create Python guis here
Qt is available as an Open Source license from
TrollTech for open source projects if you want to install it on
your own machine. PyQt can be found at
- wxPython. This is the toolkit that Boa Constructor uses. Boa
Constructor has a drag-and-drop GUI builder, like qtdesigner, but
targeted to Python. (note: boa constructor is still very much in
development, and there seems to be an incompatibility between the
0.4.4 version and python 2.4 under linux)
- Cocoa/PyObjC. If you're a Mac person (we have a shiny new Mac
lab!) you can use Interface Builder to build a GUI and PyObjC to
provide the glue. Jean Bovet has written a
tutorial on how to do this,
Construct a corpus for a domain of your choosing. Experiment with the
size of the corpus. Test your focus crawler with these corpora and
evaluate the results. In what ways does the crawler perform better or
worse? What seems to be the best size for the corpus?
4. 1-5 points. Extend your crawler to deal with additional file
formats, such as PDF, word, or PowerPoint. 1 point per format, max 5