Building a web proxy server

Goal

Your goal in this project is to build a web proxy server usable by common web browsers to access remote hosts anonymously. The proxy will restrict access to certain remote servers according to a blacklist. More specifically, your proxy server must: Note following requirements:

Configuration

Your configuration file will follow the Python ConfigParser format:
[general]
port = 8080
black-list=
 http://aol\.com.*
 http://google\.com/mail.*
The URI patterns satisfy the Python regular expression syntax.

Logging

Your log file /var/log/proxy/requests.log must follow a specific tab-delimited format:

timestamp \t error code \t method-GET-or-POST \t requested remote URI

For example, if the browser requests

GET http://www.antlr.org HTTP/1.1
then you would append an entry, on a line by itself, to the log file like this:
2011-08-31 10:04:58.244145	200	GET	http://www.antlr.org HTTP/1.1
Please use the date format from datetime.now().__str__().

Resources

Remember your 3 friends: For example, to figure out how proxies work without screwing up your browser proxy settings, you can use curl:
curl --proxy localhost:8080 www.antlr.org
It sends the following data to my echo server:
GET http://www.antlr.org HTTP/1.1
User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3
Host: www.antlr.org
Accept: */*
Proxy-Connection: Keep-Alive
<NEWLINE>
For more information on APIs, check out github's API and Twitter's API. See urllib to learn how to pull data from a remote server.

Deliverables

  1. You will set up your proxy server running on your AWS machine at port 8080, which must be specified in your configuration file. I will use curl to test your proxy as well as browsers. Make sure that you can browse the web using your proxy, including sites that require cookies.

    If a client tries to access a blacklisted URI, your server should return a 403 forbidden error code to the browser. The blacklist does not affect the /proxy URI interface deployed via Apache.

  2. You must have apache/mod_wsgi running on your server and have it respond to a special URI with the contents of your log file:
    http://your-machine/requests.log
    
  3. It must also support the proxy URI for tunneling to remote hosts:
    http://your-machine/proxy?uri=a-remote-URI
    
  4. A printout, delivered at the start of class on the day its due, of your Python source. Your Python code for the proxy server should all fit in one file. Please call that file proxy.py. The two URI for the mod_wsgi server will be in separate files: tunnel.py and log.py. Make sure that your printouts is easy-to-read. In other words, make sure your lines are not wrapping all over and the tabs are set properly etc...
In summary, you have 2 servers to deliver. The proxy is written in Python and separate from Apache. To get those special URI working, you must use Apache/mod_wsgi. You can test everything locally, but you must deploy to your Amazon Web server.

Your project is due at the start of class time. Make sure that your server is running properly at that time.

Remember to keep your project clean. Your log file should not have random debugging data and you should not send debugging statements back to the browser for the system you deploy for us to grade.

WARNING: There is a huge number of examples out there that do similar things. You must not cut and paste entire methods or projects. You must do your own work on this in order to learn how proxies work. You are allowed to grab little snippets of code like I used when prototyping this project: datetime.now().__str__(). As always, this project should not be a group effort between multiple students.