Your goal in this project is to learn how to store and retrieve data using 2 databases: MySQL, MongoDB. We will also compare their performance just for fun. I have posted two zip files from Digital Corpora, into the S3 storage system at Amazon. There is about 1.2G of uncompressed data. This isn't a lot, but gives us enough to play with without having to use expensive virtual machines at Amazon. There are many different data file types within the zip files, but you only need to worry about PDF, HTML, and text files. Our goal is to extract all of the text in a file and then insert that text as a single unit into the databases as a row in a table or documents in a collection. For n files, there will be n records in each database.
Once you have all of the data in the databases, we can search for keywords with the search page for each database. For example,
/search?db=mysql&q=us%20embassy
You will also provide a REST interface for each database to extract a single document. We will use the filenames as unique IDs since they are numbered. For example, one of the files might be called 001000.txt. The URI
/mysql/001000.txt
Obviously this functionality could be cobbled together without writing any software using existing packages. The purpose of this exercise is to learn how all of those work.
I have uploaded to zip files from the digital Digital Corpora library and put them into the S3 storage Amazon. From your box at Amazon, you can pull those files with the following commands:
# DO THESE ON AWS MACHINES curl -o 001.zip https://s3-us-west-1.amazonaws.com/parrt-test/corpus/001.zip curl -o 002.zip https://s3-us-west-1.amazonaws.com/parrt-test/corpus/002.zip
If you plan on testing and building your software on a local machine before deploying on Amazon's machine, PLEASE could not download from Amazon with those commands to your laptop or home machine because it costs money for bandwidth. Instead, you can pull directly from the corpora site:
# DO THESE ON YOUR LAPTOP OR HOME MACHINE (NON AWS MACHINES) curl -o 001.zip https://domex.nps.edu/corp/files/govdocs1/zipfiles/001.zip curl -o 002.zip https://domex.nps.edu/corp/files/govdocs1/zipfiles/002.zip
Transferring from S3 to your computer in the elastic cloud Amazon takes about 1:15s. You do not have infinite space on these boxes, only about 8G:
Filesystem 1M-blocks Used Available Use% Mounted on /dev/xvda1 8064 853 7129 11% /
001.zip 18-Mar-2009 18:43 279M 002.zip 03-May-2009 23:44 272M
Once you have unzipped those files, you might need to delete them to save room. Our databases will essentially be copies of the uncompressed files. A word of warning, the directory that you unzip is not writable, so use the following command to make it so:
$ chmod +w .
Once you jump into directory 001, for example, you will see that we have to deal with about 2 thirds of the files:
$ cd 001
$ ls *.pdf *.html *.txt|wc -l
664
$ ls |wc -l
977
$ du -m
623 .
... -r--r--r-- 1 ec2-user ec2-user 311808 Feb 9 2006 001962.doc -r--r--r-- 1 ec2-user ec2-user 7995 Apr 11 2008 001963.rtf ...
To extract text from PDF files, we can use pdftotext, which you must install via:
$ sudo yum install poppler-utils
Then you can use it to extract text for a particular file:
$ pdftotext 001664.pdf $ ls -l 001664* -r--r--r--@ 1 parrt parrt 30677 Feb 5 2009 001664.pdf -rw-r--r-- 1 root parrt 10927 Sep 25 17:18 001664.txt
To extract text from HTML, we can simply use lynx, which you can install with:
$ sudo yum install lynx
Then, do something like the following to extract text:
lynx --dump 001396.html > 001396.txt
Naturally, you will need to use a shell script or Python program to extract all of the text from the nontext files. Do not do this manually for all files!
At this point, you should have about 1300 text files in 2 different directories, ready for us to shove into the databases.
You can ignore any files that cannot be converted by pdftotext or lynx. Some files cause errors during conversion.
insert(filename, crawldate, data) get(filename) query(terms)
Once you have a persistent object, you can loop around insert() pushing in records from the corpus. Since MySQL is not really document based, you will have to decide whether or not to use the VARCHAR or BLOB column type for big strings pulled out of text files.
We want to be do case insensitive searching of the text, but we also want to be able to retrieve the original text, so don't tolower everything before you stored in the database. This complicates the queries significantly for MySQL, depending on how you implement the queries.
You can ignore files bigger than 10M or whatever the mongodb max file size is for mongodb Python.
select * from MYTABLE where text like '%us%embassy%'
For mongodb, you can use regular expressions to do searches, but you might also consider pulling in everything to Python and doing comparisons there. Again, experiment to see what you can learn about performance.
Yes, this will be very slow to search.
As output, just list the file names that have the keywords from the search query. I.e., send the list back to the browser.
$ python dbload.py (msysql|mongodb) filenames
/mysql/filename /mongodb/filename
WSGIScriptAliasMatch ^/mysql/([^/]+) /usr/local/www/username/get.py
def application(environ, start_response):
# parse the url and get the parameters "/{database}/{filename}"
....
RewriteEngine On RewriteRule ^/search$ /search.wsgi [P,L] RewriteRule ^/mysql/([^/]+) /get.wsgi?db=mysql&file=$1 [P,L] RewriteRule ^/mongodb/([^/]+) /get.wsgi?db=mongodb&file=$1 [P,L]
/search?db=mysql&q=terms /search?db=mongodb&q=terms
$ time python dbload.py msysql filenames $ time python dbload.py mongodb filenames
$ du -m /var/data/mongodb