CS 662: AI Programming
Project #1: Focused Web Crawling

Assigned: September 22
Due: Wed., October 12, 5 pm

Total points: 100

Introduction

A web crawler is a program that 'crawls the web' - it fetches a web page, extracts all the references to other documents, and fetches each of those documents. For each of those documents, it extracts their references to other documents, and so on. In this way, it can explore the web on a automated basis.

A focused web crawler is a web crawler that looks for pages about a given topic, or that match some externally provided criteria. For example, it might only retrieve pages about basketball. In this project, you'll build a focused web crawler and experiment with two different methods for telling it what the user is interested in.


We can describe a web crawler as a search agent, and use any of the algorithms we've studied so far. For example, an unfocused web crawler might choose to use breadth-first search. Our focused crawler will instead use a form of best-first search, where we put a priority on exploring hyperlinks that look to be promising.

In a nutshell, the focused crawler will work as follows:
  1. Begin with a 'seed page' provided by the user.
  2. Enqueue this page.
  3. While not done:
  4. Dequeue a page and fetch each of the pages it links to.
  5. For each newly fetched page:
  6. Extract its text and determine how good a match it is.
  7. Enqueue the new pages according to priority; better matches will go at the front of the queue.

Grading

Each of the subtasks are described below, along with the points that that task is worth. Within a task, we'll assign 75% of the points based on functionality (how well does your code work? Does it do what it's supposed to?) and 25% of the points based on style. (Is your code commented? Is it readable? Is it efficient? etc.) For parts 3 and 4, answering the questions will count as part of the 'style' grade.

You should turn in the following:

Late penalties will be subtracted from the final score. Since the project is worth 100 points, one day late is a loss of 10 points. So, if you had a score of 85 and were one day late, you would get a 75. (Hint: turn things in on time!)


For this project, you are welcome (and encouraged) to use any of the built-in python modules. I'll make specific suggestions during the description for each part. If you use third-party code, please document it properly. As always, everyone is expected to do their own work. Code that is plagiarized from other students or from the web will result in a zero on this project.

Project Details

In the remainder of this document, we'll step through the pieces that you'll need to build a focused web crawler.

The Document class (35 points)

In this part, you'll build a Document class. This will serve as a representation of the information of interest in a web page. It should have (at least) the following data members: You should be able to create a Document object as follows:
>> import project1
>>  d = project1.Document('http://www.usfca.edu')
>> d.url
'http://www.usfca.edu'
>>  d.outwardLinks
['http://www.usfca.edu/online/about_USF/',
'http://www.usfca.edu/online/about_USF/', ... ]
>> d.text
 ['University', 'San', 'Francisco', 'About', 'USF', 'Academic',
'Programs', 'Prospective', 'Students', 'Current', 'Students',
'Faculty', 'Staff', 'Alumni', 'Giving', 'News', 'Events', 'Athletics', 
'Site', 'Index', 'Search', 'USF', 'Contact', 'Us']
How to do this?

You'll want to take advantage of Python's SGMLParser class. SGML is a general family of markup languages, of which HTML is one instance. The SGMLParser can be subclassed to build an HTMLparser that suits our needs. Let's say you call it DocumentProcessor.

In particular, you can define methods that are called when a particular tag is encountered. For outward links, we'll be particularly interested in 'a' (or anchor) tags, so you'll want to override start_a. similarly, for title, h1 and and h2, you'll want to override start_title, start_h1 and start_h2

Dive Into Python has a really nice set of examples that will help you see how to do this.

To extract the text, you'll want the DocumentProcessor to override SGMLParser's handle_data method. You'll also want to filter the text that's in the page: much of it is not very helpful for determining what a page is about.

To begin with, filter out any words that contain a non-alphabet character. This includes dates, numbers, and markup elements. You may find the string module, in particular, the isalpha() method, to be of use in this.

Next, we'll remove stop words. These are words such as a, an, he, she, and the that have little or no information in them. You can find a list of stop words here.

So, your Document class should use the DocumentProcessor to filter out the URLs from a document and store them in the outwardLinks variable. It should take all text contained within h1 and h2 tags and store them in the respective variables in the Document class. It should also take the stream of text in the page from the DocumentProcessor, split it into words, and remove anything that either contains a non-alpha character or is a stop word.

One wrinkle that you'll need to deal with: Many web pages contain href's that are relative URLs ; that is, they don't begin with "http://" and a hostname. These refer to other documents on the same server. You'll need to catch these and add the 'http://' and hostname back on.

Your program is only required to deal with plain text and HTML. You may discard URLS referring to all other document types.

Some hints:

The FocusedCrawler class (25 points)

The next step is to build an agent, which we'll call a FocusedCrawler, that can collect a set of pages for us.
FocusedCrawler should have (at least) the following data members:
FocusedCrawler should have a crawl() method that takes two arguments, both with default values: nresults, which indicates how many web pages we want the crawler to find, and threshold, which indicates 'how similar' a web page must be for the crawler to keep it.
The crawler should use the heapq module to build a priority queue. It should begin by creating a Document object using the root URL. It should then insert (score, URL) tuples representing each of the outward links into the priority queue.

As a (score, URL) tuple is dequeued, you should use the URL to create a new Document class. If the score of that Document is above the user-provided threshold, then save the document. Extract all outward links from the Document and enqueue them as (score, URL) tuples, where score is the score of the document that contained the outward link.

One thing we need to be aware of when crawling the web is not overloading remote servers. In order to help direct crawlers and spiders, many servers will use a file known as robots.txt to indicate which directories a crawler is allowed to visit.
Python has a module called robotparser that automates much of this process. You can use it to fetch a robots.txt file (if one exists) and then use the can_fetch method to determine whether a URL is 'safe' to fetch.

Your crawler should use the robotparser to check for robots.txt permissions before visiting any sites.

We'll also want to make sure that we don't revisit pages that we've seen before by maintaining a closed list. For this project we'll keep it simple and just maintain a 'closed dictionary' of URLs that have been previously visited. Before creating a Document, check your closed dictionary to see if the URL has already been visited. If so, ignore it. If not, visit it and add it to the closed disctionary.

Once you've found enough pages with a score higher than the threshold, you may terminate the crawl and return all results.

If ignoreInternal is True, then your PageCrawler should ignore internal links, which are links to other pages whose URL has the same hostname. For example, if we were crawling www.cs.usfca.edu with ignoreInternal set to True, we would ingore links to www.cs.usfca.edu/~brooks. (but not to www.usfca.edu, since that's a different hostname.)

But how to score a page?

That will be the focus of parts 3 and 4. For the moment, you can assign pages a random score between 0.0 and 1.0. (use the random module for this.) For this part, focus on getting the search algorithm working correctly.

One issue: the heapq module implements a min-heap; the smallest object is always at the front. But we need a max-heap; the highest priority page is at the front. I'll let you figure out how to solve this.

Query-driven search (15 points)

Our initial stab at finding webpages that match a user's interests will be by using a keyword query . This is a set of keywords (or disjunctions of keywords) that should appear in documents. I'm sure you've used this sort of interface; it's what almost all search engines use. In general, keyword queries are great if we know specific words that are in documents of interest.

We will represent a boolean query as a string of the following form:

"word1 word2 (word3 word4 word5) word6 word7 (word8 word9) ..."

Where words inside parentheses are treated as ORed together, and all other words are ANDed. So, if we have the query

"cat dog bird (bunny snake) fish (horse lion)"

we would like to find documents that contain the words cat, dog, bird and fish, either bunny or snake (or both), and either horse or lion (or both).

To score a document, count the number of ANDed words that are contained within the document, along with the number of ORed clauses that have any word in the document. To normalize this, we then divide by the length of the query, which is the number of ANDed terms plus the number of ORed clauses.

For example, if we used the above query on a document containing the words 'cat cat bunny snake horse fish' we would have a score of 4 / 6, Notice that multiple occurrences of a word count as a single match. Your scorer should always return a value between 0 and 1.

Your crawler should use the KeywordQueryScorer to assign a priority to a web page. All the outward links (references to other pages) in this page receive this priority when they are inserted the priority queue. (Remember, highest priority at the front!) KeywordQueryScorer should have a score() method that takes a page object as input and returns the score as defined above.

Some hints:

The re (regular expression) module is very useful for parsing queries.


Answer this part!

You should now be able to feed your crawler a root URL and a search query and let it try to find documents that match the query above the threshold that you provide. Try several queries, some with a small number of terms and some with a larger number. Try both common and rare words. Look at the results that your crawler returns. Experiment with different thresholds as well (0.75 is a good starting point). What threshold seems to find documents that fit your information needs (we'll call that precision) without missing too many useful documents (we'll call that recall).

Write a paragraph characterizing the performance of your crawler at this point. I'm particularly interested in your explanation of why you think you are seeing the behavior you are seeing. Be specific; for example, how accurate do you think your crawler is? Does it find documents that seem to match what you were looking for? How many documents does it look at in order to find enough matches? If it performs well, explain why that is. if you feel that the performance is not as good as it could be, explain why that is as well.

Topic-driven search (25 points)

Our KeywordQueryScorer has a few problems. For one thing, it doesn't take word frequency into account. If I do a keyword search with the query 'aardvark blue', documents containing one of these words are all scored equally, even though aardvark is a much less common word than blue.

Also, the KeywordQueryScorer provides a particular mode of interaction between the user and the agent. For some tasks, using keywords may work fine, but in other cases, they may be cumbersome for users. For example, users often have a difficult time building complex queries: the average Google query is 2 or 3 words long. Also, a user may have a hard time coming up with keywords that specifically describe her interests. Instead, she might want to say "find more documents like these." To address this, we'll provide the FocusedCrawler with a set of pages (which we'll call a similarity set ) that represents the topics the user is interested in. The FocusedCrawler will use the similarity set to find other pages that are similar.

For this part, you'll implement a class called TFIDFScorer. It should have two methods, __init__() and score(). (You may find it helpful, or at least aesthetically appealing, to make an abstract DocumentScorer class and derive both scorers from it.)

Term Frequency

__init__() should read a list of URLs in from a file and construct the similarity set. To do this, create a Document object for each URL, and then pass the text from each of these pages into a frequency counter. This frequency counter should be implemented as a dictionary that maps words to their number of occurrences in all documents in the similarity set. we will call this dictionary the term frequencies. Our intuition will be that words that occur frequently in the similarity set are useful words to look for.

But we don't want any frequently-occurring word; we want words that occur frequently in the similarity set, but less frequently in general. for example, 'aardvark' appearing 10 times in the similarity set seems more relevant than 'date'.

Document Frequency

To address this, we'll count how frequently words occur in a random set of documents. We'll refer to the frequency of words in documents in general the document frequency.

To compute the document frequency, select 1000 randomly-chosen web pages. We will refer to this body of text as a corpus. It's meant to be a statistically significant estimation of the actual frequencies of words in English.

There are 1000 pages in /home/public/cs662/corpus that you may use for this. Alternatively, you may build your own. There are several engines on the web that will return random web pages to you. (put 'random URL' into google to find some.) I used http://random.yahoo.com/fast/ryl.

For each word in all of the pages (after removing stop words, tags, and non-alphabetic words as above) , count the number of documents it appears in. This is the document frequency of the word. For example, if 'cat' occurs in 3 documents, it has a document frequency of 3.

TFIDF

Term frequency and document frequency allow us to assign a score to each word in the similarity set indicating its relative strength in characterizing the documents in the set. We compute this score using a formula known as TFIDF (term frequency * inverse document frequency).

Each word in the similarity set should be assigned the following score:

termFrequency(word) * log(sizeOfCorpus / documentFrequency(word))

(If a word appears in the similarity set but not in the corpus, you should use 1 for its document frequency)

Documents as Vectors

Why are we doing this? It allows us to construct a model of a text document as an n-dimensional vector, where n is the number of words in the corpus, and the value of the nth element in the vector is that word's TFIDF score with respect to that document.

We can now construct a vector that represents the TFIDF scores of all words in the similarity set.

Let's add one more wrinkle. Back in part 1, we collected all of the words in the title and h1 and h2 headers. Presumably those words are particularly important in classifying the document. To acocunt for this, add a multiplier for 2 to the TFIDF score of each word in an h2, 3, for each word in an h1, and 5 for each word in the title.

Scoring a page

Your TFIDFScorer should have a method called score() that takes a Document as input. This is the page that our crawler is exploring; we'll call it the query page.

Your scorer should construct a vector representing the query page in the same way the vector was constructed for the similarity set: remove stop words etc., count term frequencies for each word, and compute TFIDF using the document frequencies from the corpus.

At this point, we have two vectors to compare. To do this, we will measure the angle between the vectors. Documents that are identical will have identical vectors, and an angle of 0. Documents that are completely dissimiliar will have an angle of 90 (or pi/2, in radians). In fact, rather than measuring the angle itself, we'll measure the cosine of the angle. This is easier to compute, and also has the advantage of being in the range (0,1), like our other scorer. An angle of degree 0 has a cosine of 1, and an angle of degree 90 has a cosine of 0. (higher cosine == more similar)

To compute the cosine of the two vectors, use the following formula:

In words, the numerator tells us how similar the documents are. For each word in the query page's vector, multiply its weight by its weight in the similarity page's vector. (If a word is absent from either vector, its weight is 0). Sum up all the weights.

The denominator normalizes this score by dividing by the total length of the vector. (This is Pythagoras' rule). For each word in the similarity vector, square its weight. Add up these weights and take the square root of the sum. Do the same for the query vector. The product of those two square roots is the vector's length.

This cosine is the page's score. You're done! Your TFIDFScorer should return this so that your FocusedCrawler can use it to enqueue pages.

Hints:

List comprehensions are your friend. They make it very easy to (for example) compute a list of weights for a long list of words.

Vectors should be stored in dictionaries, as should frequency counts.

You only need to compute the document frequencies once. Don't do it every time you run your crawler (it takes a while). Compute them once and write the object out to a file. Next time you run your crawler, read it back in. To do this, use the pickle module. dump() and load() are the methods you'll be most interested in. (Just read it in once each time you run the crawler)

sqrt and log are in the math module, which must be imported. reduce() is a very nice way to apply a function to a list of numbers and get a scalar back. It works very nicely in conjunction with list comprehensions. For example:

>>> l = [1,2,3,4,5]
>>> def add(x,y) :
... 	return x + y
... 
>>> reduce(add, l)
15
If all you need to do is add up the elements of a list, you can also use sum().

(for those of you who've used Lisp or Scheme, Python also has a lambda operator, which lets you avoid defining add separately. If you have no idea what lambda is, don't worry about it.)

If you want to know more about TFIDF, and vector models for comparing documents in general, you may be interested in the following documents:

(these are not required reading; I'm just providing them in case you want more information)


Answer this part!

Now run your crawler with the new TFIDFScorer. Build a similarity set of at least 10 documents to characterize a user's interests. the more specific they are, the better - a page about the rules of baseball is better than the front page of cnn.com. Use a threshold of 0.75 as a metric for 'similar'. Look at the documents that your scorer finds - would you consider them acceptably similar?

Write a paragraph comparing the performance of your crawler using the TFIDFScorer to the crawler with the KeywordQueryScorer. Be specific; for example, how many matches does each scorer find? What sorts of documents do they do well on? What do they do poorly on? What sorts of problems do both scorers face?


Extra credit

On Extra credit: Below are four possible tasks that you can complete for extra credit. Each of these will require that you learn or explore some topic beyond that covered in the project. They're intended to encourage you to learn more on your own about topics that interest you.

Extra Credit Rules:

1. 5-10 points. Search efficiency. There are a number of practical things that can be done to speed up our crawler. For example, much of the execution time is spent waiting for a response from remote web servers. The obvious solution is to fetch several pages at the same time.

There are two approaches to doing this. One is to use multiple threads, and the other is to use select(). Extend the FocusedCrawler to use either of these approaches for 5 pts each.

2. 5-10 points. Add a graphical front end to your client. The GUI should have a way for the user to enter search queries and also specify similar documents. It should also provide hypertext links for each of the web pages returned. You may also provide additional features, such as visualization of the search, integration with Firefox, etc. Be creative! This is worth 5-10 points, depending on the functionality, usability, and features provided by your GUI.

There are a number of different graphical toolkits available for Python. These include: 3. 5 points. corpus construction. The accuracy of TFIDF depends on the quality of the corpus that's used to construct document frequencies. By customizing the corpus, we can hopefully improve performance when searching in particular domains. For example, if we know that all of our users are interested in Java, we can collect a corpus of Java pages; this will provide a better model of word usage for our task.

Construct a corpus for a domain of your choosing. Experiment with the size of the corpus. Test your focus crawler with these corpora and evaluate the results. In what ways does the crawler perform better or worse? What seems to be the best size for the corpus?

4. 1-5 points. Extend your crawler to deal with additional file formats, such as PDF, word, or PowerPoint. 1 point per format, max 5 points.