Below is an outline for the projected timeline of the Chronica project, including goals and componenets for each deliverable .... Click to see the outline...
Chronica Project: Internet Archive Search Engine for Archive Files.
Overview:
Create a search engine that will work in conjunction with a web crawler
to index and create a Lucene index base of archive (ARC) files
created by the web crawler. Also develop a web interface to query
the ARC file index for searching by keyword and content. In addition,
create a time based search and display for queries into the index.
Initial application composed from components of Nutch and Heritrix web
crawlers.
Goals:
Deliverable 1:
Adapt either Nutch or Heritrix to do the following:
Output ARC files composed of web data collected and
index ARC files using Lucene to create an index that will allow
searching by keyword parameters. Content of index
will be limited to HTML and plain text content of ARC files at this time.
Deliverable 2:
Create a web browser interface to conduct search of ARC file Lucene
database. Add pdf, MS Word document, image search support to indexing
of ARC files.
Deliverable 3:
Create a time and date based search query for the Web interface
developed above to display text description search term significance and trends
on results page of web interface.
Deliverable 4:
Further develop time and date based search query for the Web interface
developed above to display graph of search term significance and trends
on results page of web interface in addition to keyword links.
Deliverable 5 (Final):
Total system described above including search engine/crawler
with web interface.
To help you on your way, the below might be of some help:
+ Here is a proxy handler for jetty: http://cvs.sourceforge.net/viewcvs.py/archive-access/archive-access/projects/arc-collection-proxy/. If the request is for a machine other than the local machine, an attempt is made to read the request from a local ARC Collection. Of interest would be the way that it does a query on a lucene index to find the ARC file the URL resides in and at what offset. Also of interest is how ARCReader is used to pull the record from the ARC. See the src/scripts directory for groovy scripts used to make the lucene index (it uses ARCReader). The proxy code is in src/java.
+ Here is a nutch that has been populated by running its fetcher against the Wayback Machine: http://crawlprojects.archive.org. It needs work but it might give you some ideas. I have a bunch of notes on what was done to set this up and I can pass them to ye if ye want them. Just ask.