Personal Web Neighborhood
Project Outline
Purpose
1. Build a system that allows a user to easily create a personal web
neighborhood, that is, a search area consisting of web pages near the
documents and bookmarks in her personal storage area.
2. Allow such neighborhoods to be shared amongst individuals.
3. Perform tests to evaluate the usefulness of a personal web neighborhood.
Issues
We need formal definitions of personal web and personal web
neighborhood.
The WebTop system analyzes a user's hard disk to create a personal web. One
way to implement the system would be to extend WebTop so that it crawls the web
using the documents in the personal web as seeds. Note that the HTML document
class contains code to open and extract terms and links from a web document.
This can be used. What is not there is code to drive the crawl. And I don't
think timeouts, etc. are dealt with very well.
How far to crawl? Do we crawl inward and outward links?
Do we perform a focused crawl where we follow links that fit some formula (e.g., follow links that contain terms similar to those in the personal web)?
Do we seed from all personal web documents, or those that are used more often (e.g., let the crawler continue the users work when the user is sleeping).
Does the crawler archive documents, or just extract keywords to build an inverse index which will allow the document to be found on a search?
For evaluation, we need to test how useful search of the personal web neighborhood is. Do we compare it to a Google search?
Related Work
Accelerated focused crawling through online relevance feedback
S Chakrabarti, K Punera, M Subramanyam -
Cited by 21
Personalized
and Focused Web Spiders
M Chau, H Chen -
View as HTML -
Cited by 2
Sphinx, below, comes with Java code that you can download...
SPHINX: a framework for creating personal, site-specific web crawlers
RC Miller, K Bharat -
Cached -
Cited by 51
Also see Chronica work by USF Students (Stevens, Endo, Fraschetti, Dennis)
and the Archive's Heretrix Project