Here are a few questions that we need to answer: See expanded section...
-> How to return a page from an ARC file to the user, i.e. extraction and display. How does IA do it? Is the page cached and then displayed or is it dynamically created straight from the ARC file?
-> (Thinking ahead to D2) What language should the Web interface be in? Java? php?
-> Should we create a completely separate application to index the ARC files after they are created or shoul d Chronica be linked to Heritrix? (i.e. as a extra filter/processor that indexes each ARC file after it is created by Heritrix).
-> Should the links (including offsets into the ARC file) or the actual ARC records be passed back to the user interface when the Lucene results are collected?
+ See my posting on http://cs.usfca.edu/~rstevens/archiveproject/archives/000432.html#more
on the proxy server; it pulls an ARC Record from an arc file.
+ The engine should be java if its going to use ARCReader.
+ I'd suggest doing your application separate from heritrix but structuring it so pieces of it could be used elsewhere, as an index writer hooked into heritrix as you suggest above.