Slashpack: An integrated tool for collecting, processing,
indexing, clustering and classifying hypertextual collections.
Slashpack (which is short for Semi-large scale hypertext package) is
meant to be an integrated, all-in-one, one-stop-shopping tool for the
collection, storage, analysis and retrieval of collections of
hypertext documents. The goal is to make this toolkit as general,
extensible and flexible as possible, so as to allow it to be used for
a variety of tasks. The tool will consist of six basic components: a
collector, a processor, a storage and indexing system, a clustering
engine, a suite of classifiers, and visualization component. Each of
these pieces will be described in further detail below.
The tool will initially be written in Python to take advantage of
Python's built-in libraries and facility for rapid application
development. As we enter the optimization and performance analysis
phase, portions of the tool may need to be rewritten in C/C++ and
linked in using SWIG.
There are a number of other projects (some open-source, some not) that
provide some of this functionality. We should strive not to ignore
them, but to learn from both their successes and their failures. These
include:
- Heretrix, an
open-source web crawler written in Java.
-
Mercator A crawler built by Compaq, now sold commercially by
AltaVista.
- Lucene, A
java-based engine for indexing and retrieval of documents.
- Harvest Harvest is
an open-source system that can collect and index hypertext documents.
- Lemur is a toolkit
for performing language modeling, summarization, classification and
clustering on a corpus of documents, written in C/C++.
- Weka is a
Java-based toolkit for machine learning, data mining, clustering and
text classification.
- There are many others as well ...
Given all of these other toolkits, why build another?
First off, because we can, and it will be an interesting and
educational project.
More importantly, Slashpack is intended to serve not just as a
research tool in its own right, but also as a foundation or specific
applications we're interested in. This means that it will be very
helpful for us to have direct control over the contents and
functionality of Slashpack.
Currently, there are two potential projects that plan to use
Slashpack. they are:
- Analysis of the effectiveness of tags and auto-tagging mechanisms
in blogs and social bookmarking systems. This is thesis work currently
being undertaken by Nancy
Montanez.
- Peer-to-peer personal information sharing. Slashpack will also be
used to develop a peer-to-peer tool sharing personal webs,
matchmaking, and community formation. This will serve as a
next-generation version of the personal web ideas originally developed
in Webtop, with Slashpack
providing the infrastructure needed for user and community modeling.
Design principles
We would like for Slashback to be a real-world, ongoing project that
can both be of service to the research community and also serve as a
source of student projects. Consequently, all components should be
developed with the following software engineering principles in mind:
- Maintainability
- Ease of use
- Extensibility
- Efficiency
Slashpack components
This section will discuss each of the components of Slashpack in more
detail, flesh out some of the functionality that's required, and
provide links to related reading. As the project is in its early
stages, this is not meant to be comprehensive or exhaustive, but to
give a flavor of what each component should provide.
Collector
The collector will take as input a description of a set of documents
to be gathered. Based on this description, it will gather all possible
documents, robustly handling errors, restarts, and failures.
This 'description' may be quite general. Some possible descriptions
include: a tarball or set of tarballs, the root index of a site to be
collected, a set of URLs from which a general web crawl is to be
performed, or a set of keywords to be forwarded on to search engines
such as Google, Yahoo!, and AltaVista. A user may also want to specify
the number of documents to collect, a restriction on document sizes
(only get documents less than 4K, only get the first 1000 words, etc),
or a specification on document types (only HTML, only PDF, etc).
A robust, large-scale collector will require the following features:
- The ability to stop and restart a collection. This will require
the collector to periodically save both documents collected so far and
the state of its collection. (Pages previously visited, number of
pages collected, etc.)
- The ability to recover gracefully from network errors. These
might include network failure, the failure of a remote http server,
inaccurate or incomplete http headers, servers that don't follow http
properly, or other transient failures. In general, the collector
should be able to recognize that a particular document cannot be
accessed and move on accordingly.
- the ability to collect documents quickly and efficiently. If
collections are being gathered from a number of sources, the collector
should seek to maximize the documents collected per second. This will
require:
- Careful handling of DNS, possibly including DNS prefetching and
resolution on the client side.
- The fetching of multiple documents simultaneously, via the use of
wither multiple threads or of select.
- Detection of duplicate pages. The simplest mechanism for this is
computation of an MD5 sum of the page, but this will not catch
"nearly-identical" pages.
Shingling is a potential technique for identifying
near-duplicates.
- Proper handling of robots.txt.
- Recognition of `spider traps', or pages that (intentionally or
not) contain an infinitely-deep set of links designed to trap
crawlers.
It will, of course, also need to be integrated with the processor and
the storage and indexing system.
Some reading:
Processor
The processor will be a set of modular components that can be 'hooked
together' to allow a use to specify the pieces of a document that
should be retained and the pieces that should be discarded. For
example, a user should be able to say "keep all outward links, nouns,
and dates" or "remove all tags and suffixes, and replace common
synonyms". Each of the components (or filters ) should be able
to operate independently of the others; they should take as input a
document (most likely tokenized by a prebuilt lexer) and produce as
output a new document (possibly also as a sequence of tokens).
Some filters the processor should have include:
- Tag stripper. Remove all html tags specified by the
user. Removing all tags is a special case of this.
- Proper Name filter. Extract proper names.
- Date Filter. Extract dates.
- Stemming, using the Porter stemming
algorithm.
- removal of stopwords.
- Part-of-speech filter. Extract (for example) nouns, verbs, or
adjectives. This will require the use of WordNet to identify possible
parts of speech for words, along with a chart parser to help resolve
ambiguity.
- Language filter. remove non-English words. A first cut of this
can be done with WordNet; more accurate results may require the
construction and training of a classifier.
- Synonym replacement. Allow the user to provide a thesaurus that
gives canonical representations for common words (e.g. cat, kitten,
kitty all map to 'cat'). We should provide some useful thesauri, for
example for dealing with computer jargon ('mac', 'macintosh' map to
'Mac', 'Win', 'WindowsXP', 'Windows' all map to Windows.). Some of
these can be constructed by hand, and some can be learned via Latent Semantic Analysis
- Frequency weighting of terms. This will most likely include
variants of TFIDF.
Some reading:
Storage and Indexing
The storage and indexing component will be responsible for cataloging
pages that have been filtered and processed, and also for updating and
maintaining metadata about these pages.
The storage and indexing component will need to be built with two
concerns in mind: space-efficient storage and quick, flexible
retrieval. Unfortunately, these concerns are often in opposition: for
example, an easy way to conserve space is to compress stored pages (as
the Internet Archive does), but this adds a delay to retrieval, as the
page must be uncompressed before it can be delivered to the user.
Finding the appropriate balance between these issues will be an
ongoing problem.
The storage and indexing component should ideally provide the
following features:
- Robust storage of collected and processed pages. This may include
RAID-style replication of data, or the use of checksums to allow
corrupted documents to be reconstructed.
- Efficient use of space, along with fast disk-level
retrieval. This may necessitate the use of something along the lines
of the 'ARC' files used at the Internet Archive, which maximize disk
space usage (files are the same size as disk blocks) and provide
spatial locality.
- A rich search and querying mechanism. Users should be able to
locate documents that satisfy their information needs according to a
rich, heterogeneous set of criteria. These may include:
- Date collected
- URL or domain
- keywords
- regular expressions
- PageRank
- inward links
- topic (see classification below)
Facilitating these sorts of search will require the maintenance of
auxiliary data structures to store the metadata about each document,
including an inverted index, along with a secondary structure (perhaps
a relational database) that contains relevant metadata about each
component.
- The ability to export a collection into a single package, for
example for distribution as a canonical testbed.
Some reading:
Clustering engine
Clustering is the process of grouping documents together according to
some criteria (usually topic similarity) without any prior labeling of
the documents or (often) decision of what the categories are that will
describe documents. For this reason, it is usually described as a form
of unsupervised or semi-supervised learning.
The first thing that clustering requires is a metric for assessing the
similarity of two or more documents. The most commonly-used metric is
cosine similarity, but we should provide the user with the ability to
specify other metrics, as well as provide their own.
There are several well-known clustering algorithms that can be
implemented, including:
- Self-organizing maps. This is a neural approach that is used to
discover patterns in high-dimensional data. Multidimensional scaling
is a similar statistical technique.
- Agglomerative clustering is an example of an unsupervised
learning approach known as k-means clustering.
- Hierarchical clustering is an extension of agglomerative
clustering that seems to produce more accurate results.
- Expectation Maximization is a semi-supervised learning technique
that assigns probabilities to categories for documents. It can be
combined with k-means to simultaneously learn topics and groupings.
Some readings:
Classifier Suite
Classifiers are machine learning algorithms that are able to assign
documents to one or more categories based on some feature of the
document. (for example, spam/not-spam, interesting/not interesting,
topic, etc). The primary difference between classifiers and clustering
algorithms is that classifiers are supervised learning
algorithms. That is, the possible categories or classifications are
known in advance, and the algorithm is provided a set of labeled
training data (documents that have already been categorized).
The slashpack classifier suite should provide two things for users: 1)
a variety of tunable classification algorithms and 2) a means to
easily and flexibly specify learning regimens, including training and
test sets, validation, and desired recall and precision.
Classification algorithms to be implemented include:
- Naive Bayes, including the optimizations suggested in this
paper.
- Support Vector Machines. These are currently considered the state
of the art in algorithms for text classification. Here's a nice collection
of resources on SVMs.
SVMlight is one of the most well-known SVM packages. Here
is the original paper describing text classification with SVMs.
- Decision trees. Decision trees have also been a popular technique
for text classification, primarily due to the availability of
high-performance DT software, such as C4.5 and Weka.
- Rule induction. Users may want to take advantage of the fact that
their data is actually hypertext, as opposed to just a bag of
words. One technique for doing this is through the induction of
if-then rules that examine the presence and/or values assigned to tags
or combinations of tags. FOIL and
RIPPER are commonly-used packages for learning rules. Here
is a paper describing the application of rule learning to text
classification (spam, in this case).
- Relevance feedback (the Rocchio algorithm is an example) is the
process of asking the user to provide feedback as to the quality of
the classifications done so far, and using this additional information
to improve future classifications. See the paper above for more details.
A related issue to text classification is feature selection. The
performance of text classifiers can vary drastically depending on the
features that are used. there are lots of approaches to this,
including IR methods such as TFIDF, use of anchor tags and other
hypertext markup ( here
is a paper describing the use of anchor tags to improve
classification performance), and statistical methods such as the
chi-square test and mutual information.
Some readings:
See the papers linked above, plus:
User interface and visualization
Slashpack will also require a user interface (preferably web-driven)
that allows a user to easily set up an experiment from start to
finish. Even more importantly, this user interface will need to
automatically process and filter the results of experiments, construct
graphs and visualizations that summarize the experiments, and export
those graphs to a common, easily used format (most likely Encapsulated
Postscript).
the Heretrix interface may be a useful starting point, although
something more user-friendly may eventually be needed. for
automatically constructing graphs, I've found Ploticus to be very
useful.