Twice-cooked Data

Goal

Your goal in this project is to learn how to store and retrieve data using 2 databases: MySQL, MongoDB. We will also compare their performance just for fun. I have posted two zip files from Digital Corpora, into the S3 storage system at Amazon. There is about 1.2G of uncompressed data. This isn't a lot, but gives us enough to play with without having to use expensive virtual machines at Amazon. There are many different data file types within the zip files, but you only need to worry about PDF, HTML, and text files. Our goal is to extract all of the text in a file and then insert that text as a single unit into the databases as a row in a table or documents in a collection. For n files, there will be n records in each database.

Once you have all of the data in the databases, we can search for keywords with the search page for each database. For example,

/search?db=mysql&q=us%20embassy
would look in the mysql database for all records containing words us and embassy.

You will also provide a REST interface for each database to extract a single document. We will use the filenames as unique IDs since they are numbered. For example, one of the files might be called 001000.txt. The URI

/mysql/001000.txt
would return that document via the REST interface. Remember, that we are pulling from a database not a directory on the disk (which we could do simply with Apache).

Obviously this functionality could be cobbled together without writing any software using existing packages. The purpose of this exercise is to learn how all of those work.

Data

I have uploaded to zip files from the digital Digital Corpora library and put them into the S3 storage Amazon. From your box at Amazon, you can pull those files with the following commands:

# DO THESE ON AWS MACHINES
curl -o 001.zip https://s3-us-west-1.amazonaws.com/parrt-test/corpus/001.zip
curl -o 002.zip https://s3-us-west-1.amazonaws.com/parrt-test/corpus/002.zip

If you plan on testing and building your software on a local machine before deploying on Amazon's machine, PLEASE could not download from Amazon with those commands to your laptop or home machine because it costs money for bandwidth. Instead, you can pull directly from the corpora site:

# DO THESE ON YOUR LAPTOP OR HOME MACHINE (NON AWS MACHINES)
curl -o 001.zip https://domex.nps.edu/corp/files/govdocs1/zipfiles/001.zip
curl -o 002.zip https://domex.nps.edu/corp/files/govdocs1/zipfiles/002.zip

Transferring from S3 to your computer in the elastic cloud Amazon takes about 1:15s. You do not have infinite space on these boxes, only about 8G:

Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/xvda1                8064       853      7129  11% /
and the files are about .5G compressed:
	001.zip	18-Mar-2009 18:43	279M	 
	002.zip	03-May-2009 23:44	272M	 

Once you have unzipped those files, you might need to delete them to save room. Our databases will essentially be copies of the uncompressed files. A word of warning, the directory that you unzip is not writable, so use the following command to make it so:

$ chmod +w .

Once you jump into directory 001, for example, you will see that we have to deal with about 2 thirds of the files:

$ cd 001
$ ls *.pdf *.html *.txt|wc -l
664 
$ ls |wc -l
     977
$ du -m
623	.
Also note that the timestamp on the file is the crawl datetime for that file:
...
-r--r--r-- 1 ec2-user ec2-user    311808 Feb  9  2006 001962.doc
-r--r--r-- 1 ec2-user ec2-user      7995 Apr 11  2008 001963.rtf
...
You can use the timestamp of the converted file for pdf and html files.

Extracting text

To extract text from PDF files, we can use pdftotext, which you must install via:

$ sudo yum install poppler-utils

Then you can use it to extract text for a particular file:

$ pdftotext 001664.pdf
$ ls -l 001664*
-r--r--r--@ 1 parrt  parrt  30677 Feb  5  2009 001664.pdf
-rw-r--r--  1 root   parrt  10927 Sep 25 17:18 001664.txt

To extract text from HTML, we can simply use lynx, which you can install with:

$ sudo yum install lynx

Then, do something like the following to extract text:

lynx --dump 001396.html > 001396.txt

Naturally, you will need to use a shell script or Python program to extract all of the text from the nontext files. Do not do this manually for all files!

At this point, you should have about 1300 text files in 2 different directories, ready for us to shove into the databases.

You can ignore any files that cannot be converted by pdftotext or lynx. Some files cause errors during conversion.

Loading the databases

In class, we discussed how to use Python to insert and query for data in the 3 databases we are using for this project. Because only the persistence layer is different between the 3 databases, create a persistence layer that isolates the data insertion and querying from the underlying database implementation details. This amounts to creating objects that answer methods such as:
insert(filename, crawldate, data)
get(filename)
query(terms)
You can create whatever internal structure you want for your program; that is merely a suggestion. These objects can be reused with mod_wsgi to implement the RESET interface.

Once you have a persistent object, you can loop around insert() pushing in records from the corpus. Since MySQL is not really document based, you will have to decide whether or not to use the VARCHAR or BLOB column type for big strings pulled out of text files.

We want to be do case insensitive searching of the text, but we also want to be able to retrieve the original text, so don't tolower everything before you stored in the database. This complicates the queries significantly for MySQL, depending on how you implement the queries.

You can ignore files bigger than 10M or whatever the mongodb max file size is for mongodb Python.

Queries

Queries in MySQL use the '%' wildcard operator. For example,
select * from MYTABLE where text like '%us%embassy%'
finds all records with the words us and embassy in column text. This becomes unwieldy when you want to search for more than a single word, because you need to have the reverse of that string as well to check the other order. What about finding whole words instead of just those strings? Note that mysql does a case insensitive search for like, if I can remember. Alternatively, you could read all of the records into memory and use a string compare in Python, but I would expect this to be much slower. There are also Full-Text Searches in MySQL, but I don't think they work with the innodb engine used commonly in MySQL installations. Play around to see if you can get good performance.

For mongodb, you can use regular expressions to do searches, but you might also consider pulling in everything to Python and doing comparisons there. Again, experiment to see what you can learn about performance.

Yes, this will be very slow to search.

As output, just list the file names that have the keywords from the search query. I.e., send the list back to the browser.

Deliverables

Data loader

Create a loader program called dbload.py that takes two arguments from the command line (database name and a list of files):
$ python dbload.py (msysql|mongodb) filenames

URIs

Make the following REST URIs available via mod_wsgi and Python for pulling files back out of the database:
/mysql/filename
/mongodb/filename
In order to map, those URI to the appropriate WSGI Python script, you might consider something like
WSGIScriptAliasMatch ^/mysql/([^/]+) /usr/local/www/username/get.py
That will execute Python file /usr/local/www/username/get.py and pass in parameters you can pull from the environ Python parameter to application():
def application(environ, start_response):
    # parse the url and get the parameters "/{database}/{filename}"
    ....
Or, in the httpd.conf
RewriteEngine On
RewriteRule   ^/search$ /search.wsgi [P,L]
RewriteRule   ^/mysql/([^/]+) /get.wsgi?db=mysql&file=$1 [P,L]
RewriteRule   ^/mongodb/([^/]+) /get.wsgi?db=mongodb&file=$1 [P,L]
Also, create the following URI to perform searches on the indicated database:
/search?db=mysql&q=terms
/search?db=mongodb&q=terms

Description

Along with the source code for your project, please turn in a simple description that indicates how long it took to insert all of the files into each database. You can use the "time" shell command like this:
$ time python dbload.py msysql filenames
$ time python dbload.py mongodb filenames
Also, look at the physical data files associated with each database and describe the amount of storage required. For example, if you put your mongodb database in /var/data/mongodb, then use du to get the number of megabytes:
$ du -m /var/data/mongodb
It will be interesting to see the difference in wall clock time to insert data and also the overhead for each database (ratio of real data size to physical space requirements for that database). Please describe what you find in English.