Building a web proxy server
Goal
Your goal in this project is to build a web proxy server usable by common web browsers to access remote hosts anonymously. The proxy will restrict access to certain remote servers according to a blacklist. More specifically, your proxy server must:
- Listen to port specified in the configuration file /etc/proxy.conf
- Responds to GET protocol commands that name the full URI of the target resource:
GET http://xyz.com/foo HTTP/1.0
- Responds to POST protocol commands, which passes the request body to the target resource:
POST http://xyz.com/foo HTTP/1.1
- Strips User-Agent,Referer headers:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) ...
Referer: http://antlr.org/submit/challenge?type=grammar
- Send appropriate HTTP error codes back to the browser:
403 Forbidden (See block list below)
404 Not found (Remote URI returns 404)
- Logs all HTTP proxy requests in file /var/log/proxy/requests.log
- Restricts access to remote servers according to a "black list" configuration file, /etc/proxy.conf, containing regular expressions. The server must disallow access to any URI matching one these expressions, return a 403 error.
- Using the apache/mod_wsgi server, create a proxy ("tunnel") for REST APIs such as twitter and github by sending the remote URI to the proxy as a parameter to a special URI: /proxy?uri=remote-REST-API-with-parameters. Naturally, you will have to URL-encode the REST URI in order to pass it to the /proxy URI as a parameter. For example, in order to obtain the following data:
http://github.com/api/v2/json/user/show/parrt/followers
we would request the following URI instead:
http://your-machine/proxy?uri=http%3A//github.com/api/v2/json/user/show/parrt/followers
Our /proxy "page" pulls data from that remote github URI and returns it. This tunnel is necessary so that, later, we can ask JavaScript
to access data on other servers than from where we got the JavaScript itself. As a protection, JavaScript cannot randomly pull from servers around the net.
Note following requirements:
- You must not try to buffer the entire payload in memory. In other words, once you establish connection with the remote server, pull the data over in blocks that you can send back to the client browser.
- You must use the Python socket library to create your proxy server (not mod_wsgi). Do not use the "Requests" python library, which might not work anyway for our task. We are trying to build a proxy that must look at the HTTP protocol incoming on the sockets. We cannot do this if you're using a standard Web server that handles all of that stuff for us; those classes simply inform us that a GET or POST has occurred. we need something lower-level.
Configuration
Your configuration file will follow the Python ConfigParser format:
[general]
port = 8080
black-list=
http://aol\.com.*
http://google\.com/mail.*
The URI patterns satisfy the Python regular expression syntax.
Logging
Your log file /var/log/proxy/requests.log must follow a specific tab-delimited format:
timestamp \t error code \t method-GET-or-POST \t requested remote URI
For example, if the browser requests
GET http://www.antlr.org HTTP/1.1
then you would append an entry, on a line by itself, to the log file like this:
2011-08-31 10:04:58.244145 200 GET http://www.antlr.org HTTP/1.1
Please use the date format from datetime.now().__str__().
Resources
Remember your 3 friends:
- telnet: manually type text protocols to a server
- echo server: simple server listening at a port that prints out everything it hears from a connecting client
- curl: UNIX program to grab data from a variety of servers including HTTP servers.
For example, to figure out how proxies work without screwing up your browser proxy settings, you can use curl:
curl --proxy localhost:8080 www.antlr.org
It sends the following data to my echo server:
GET http://www.antlr.org HTTP/1.1
User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3
Host: www.antlr.org
Accept: */*
Proxy-Connection: Keep-Alive
<NEWLINE>
For more information on APIs, check out github's API and Twitter's API. See urllib to learn how to pull data from a remote server.
Deliverables
- You will set up your proxy server running on your AWS machine at port 8080, which must be specified in your configuration file. I will use curl to test your proxy as well as browsers. Make sure that you can browse the web using your proxy, including sites that require cookies.
If a client tries to access a blacklisted URI, your server should return a 403 forbidden error code to the browser. The blacklist does not affect the /proxy URI interface deployed via Apache.
- You must have apache/mod_wsgi running on your server and have it respond to a special URI with the contents of your log file:
http://your-machine/requests.log
- It must also support the proxy URI for tunneling to remote hosts:
http://your-machine/proxy?uri=a-remote-URI
- A printout, delivered at the start of class on the day its due, of your Python source. Your Python code for the proxy server should all fit in one file. Please call that file proxy.py. The two URI for the mod_wsgi server will be in separate files: tunnel.py and log.py. Make sure that your printouts is easy-to-read. In other words, make sure your lines are not wrapping all over and the tabs are set properly etc...
In summary, you have 2 servers to deliver. The proxy is written in Python and separate from Apache. To get those special URI working, you must use Apache/mod_wsgi. You can test everything locally, but you must deploy to your Amazon Web server.
Your project is due at the start of class time. Make sure that your server is running properly at that time.
Remember to keep your project clean. Your log file should not have random debugging data and you should not send debugging statements back to the browser for the system you deploy for us to grade.
WARNING: There is a huge number of examples out there that do similar things. You must not cut and paste entire methods or projects. You must do your own work on this in order to learn how proxies work. You are allowed to grab little snippets of code like I used when prototyping this project: datetime.now().__str__(). As always, this project should not be a group effort between multiple students.