Slashpack: An integrated tool for collecting, processing, indexing, clustering and classifying hypertextual collections.

Slashpack (which is short for Semi-large scale hypertext package) is meant to be an integrated, all-in-one, one-stop-shopping tool for the collection, storage, analysis and retrieval of collections of hypertext documents. The goal is to make this toolkit as general, extensible and flexible as possible, so as to allow it to be used for a variety of tasks. The tool will consist of six basic components: a collector, a processor, a storage and indexing system, a clustering engine, a suite of classifiers, and visualization component. Each of these pieces will be described in further detail below.

The tool will initially be written in Python to take advantage of Python's built-in libraries and facility for rapid application development. As we enter the optimization and performance analysis phase, portions of the tool may need to be rewritten in C/C++ and linked in using SWIG.

There are a number of other projects (some open-source, some not) that provide some of this functionality. We should strive not to ignore them, but to learn from both their successes and their failures. These include: Given all of these other toolkits, why build another? First off, because we can, and it will be an interesting and educational project.

More importantly, Slashpack is intended to serve not just as a research tool in its own right, but also as a foundation or specific applications we're interested in. This means that it will be very helpful for us to have direct control over the contents and functionality of Slashpack.

Currently, there are two potential projects that plan to use Slashpack. they are:

Design principles

We would like for Slashback to be a real-world, ongoing project that can both be of service to the research community and also serve as a source of student projects. Consequently, all components should be developed with the following software engineering principles in mind:

Slashpack components

This section will discuss each of the components of Slashpack in more detail, flesh out some of the functionality that's required, and provide links to related reading. As the project is in its early stages, this is not meant to be comprehensive or exhaustive, but to give a flavor of what each component should provide.

Collector

The collector will take as input a description of a set of documents to be gathered. Based on this description, it will gather all possible documents, robustly handling errors, restarts, and failures.

This 'description' may be quite general. Some possible descriptions include: a tarball or set of tarballs, the root index of a site to be collected, a set of URLs from which a general web crawl is to be performed, or a set of keywords to be forwarded on to search engines such as Google, Yahoo!, and AltaVista. A user may also want to specify the number of documents to collect, a restriction on document sizes (only get documents less than 4K, only get the first 1000 words, etc), or a specification on document types (only HTML, only PDF, etc).

A robust, large-scale collector will require the following features:

Processor

The processor will be a set of modular components that can be 'hooked together' to allow a use to specify the pieces of a document that should be retained and the pieces that should be discarded. For example, a user should be able to say "keep all outward links, nouns, and dates" or "remove all tags and suffixes, and replace common synonyms". Each of the components (or filters ) should be able to operate independently of the others; they should take as input a document (most likely tokenized by a prebuilt lexer) and produce as output a new document (possibly also as a sequence of tokens).

Some filters the processor should have include:

Some reading:

Storage and Indexing

The storage and indexing component will be responsible for cataloging pages that have been filtered and processed, and also for updating and maintaining metadata about these pages.

The storage and indexing component will need to be built with two concerns in mind: space-efficient storage and quick, flexible retrieval. Unfortunately, these concerns are often in opposition: for example, an easy way to conserve space is to compress stored pages (as the Internet Archive does), but this adds a delay to retrieval, as the page must be uncompressed before it can be delivered to the user. Finding the appropriate balance between these issues will be an ongoing problem.

The storage and indexing component should ideally provide the following features:

Some reading:

Clustering engine

Clustering is the process of grouping documents together according to some criteria (usually topic similarity) without any prior labeling of the documents or (often) decision of what the categories are that will describe documents. For this reason, it is usually described as a form of unsupervised or semi-supervised learning.

The first thing that clustering requires is a metric for assessing the similarity of two or more documents. The most commonly-used metric is cosine similarity, but we should provide the user with the ability to specify other metrics, as well as provide their own.

There are several well-known clustering algorithms that can be implemented, including:

Some readings:

Classifier Suite

Classifiers are machine learning algorithms that are able to assign documents to one or more categories based on some feature of the document. (for example, spam/not-spam, interesting/not interesting, topic, etc). The primary difference between classifiers and clustering algorithms is that classifiers are supervised learning algorithms. That is, the possible categories or classifications are known in advance, and the algorithm is provided a set of labeled training data (documents that have already been categorized).

The slashpack classifier suite should provide two things for users: 1) a variety of tunable classification algorithms and 2) a means to easily and flexibly specify learning regimens, including training and test sets, validation, and desired recall and precision.

Classification algorithms to be implemented include:

A related issue to text classification is feature selection. The performance of text classifiers can vary drastically depending on the features that are used. there are lots of approaches to this, including IR methods such as TFIDF, use of anchor tags and other hypertext markup ( here is a paper describing the use of anchor tags to improve classification performance), and statistical methods such as the chi-square test and mutual information.

Some readings: See the papers linked above, plus:

User interface and visualization

Slashpack will also require a user interface (preferably web-driven) that allows a user to easily set up an experiment from start to finish. Even more importantly, this user interface will need to automatically process and filter the results of experiments, construct graphs and visualizations that summarize the experiments, and export those graphs to a common, easily used format (most likely Encapsulated Postscript).

the Heretrix interface may be a useful starting point, although something more user-friendly may eventually be needed. for automatically constructing graphs, I've found Ploticus to be very useful.