CS680: Web Systems and Algorithms
Fall 2011 (Last updated: August 5, 2011)
MWF 2:15pm - 3:20pm HR232 (big lecture hall on main floor)
Instructor: Terence Parr
Office hours: Any time HR531 door is open or by appointment
First day of class: Wednesday, August 24, 2011
Last day of class: Wednesday, December 7, 2011
Exam 1: October 19
Exam 2: December 7 (last day of class)
Abstract
This course will survey topics, systems and algorithms related to the
World Wide Web and data mining. We will read articles and academic
papers, learn lots of technology, and build a number of interesting
projects. In order to be a web system architect and programmer, you
must be familiar with managing servers and installing software. We
will make extensive use of the Amazon Web Services throughout the
course. You will learn how to present data dynamically in webpages via
JavaScript technology, how to collect and crawl for data, how to store
and retrieve vast amounts of data, and how to analyze that data for
trends, similarity, and clusters.
Be forewarned that you will need to learn a lot of skills and technology on your own to complete the projects.
Requirements
You should be comfortable with:
- Writing medium-sized programs in a high-level programming language
such as Python, Jave, or C++.
- Learning new software packages and libraries with minimal supervision.
- Installing and configuring software on your own computer.
- The standard mathematical tools of computer science. This includes
- graphs, probability, and combinatorics.
- Expressing technical ideas in written English.
CS662 or CS682 would provide extremely useful background. I will also
assume that you either know or can teach yourself Python. I will
assume you know Java. I will teach you JavaScript, or at least teach you to do cut-and-paste programming in JavaScript like everybody else when they use it for jquery and ajax (the language itself is... unpleasant).
Topics
Lecture notes
Web infrastructure
Collecting, storing, and representing data
Data analysis
- TFIDF
- Naive Bayesian classifiers
- page rank
- Text data mining / filtering, language identification, speaker identification, tags/clustering
- Enron email data set; friends/networks, communication frequency, bigrams, frequency analysis.
- map-reduce, hadoop (Java)
Lectures
- 8/24: introduction and administrative stuff
- 8/26: web architecture, Amazon Web services
- 8/29:
- 8/31: DNS servers, building proxy servers
- 9/2: Services, JSON, installing and using associated python libraries, Web caching.
- 9/7: Web page scraping, simple data parsing
Labs
- jquery/ajax lab
Projects
- 5% Getting started with Amazon Web Services (Due Aug 29)
- 10% Proxy server (Due Fri Sept 9)
- 5% Building rich clients with jQuery (Due Mon Sept 26)
- 10% Twice-cooked Data
- 15% Search Engine Construction
- 20% Clustering and classifying
There are no late projects.
I will deduct 10% if your program is not executable exactly in the
fashion mentioned in the project.
Instruction Format
Class periods of 1:05min each 3 times per week for 15 weeks.
Instructor-student interaction during lecture is encouraged. "Pop
quizzes" may appear during any class.
Grading
Your grade will be computed according to the following relationship:
| 5% | Labs/Quizzes/Class participation |
| 65% | Projects |
| 15% | Exam
1 (October 19) |
| 15% | Exam
2 (December 7) |
Please note that class participation is part of your grade. You must
learn to interact with other developers and come up with solutions.
In general, I will read all papers, projects, quizzes
etc... two times. Once to evaluate the average and a second time to
assign scores. In the first pass, I also come up with a scoring
strategy for each question.
I consider an "A" grade to be above and beyond what most
students have achieved. A "B" grade is an average grade or what you
could call "competence" in a business setting. A "C" grade means that
you either did not or could not put forth the effort to achieve
competence. An "F" grade implies you did very little work or had
great difficulty with the class compared to other students.
I will be very strict and set a high standard in my grading,
but I will work hard to help you if you are having trouble. Some
of you may not get the grade you were hoping for in this class, but I
will do everything I can to make sure you learn a lot and have a
satisfying educational experience!
Unless you are sick or have a family emergency, I will not change
deadlines for projects nor exam times. For example, I will not give
you a special final exam just because you want to fly home early.
Consult the university academic calendar before making travel plans.
Books and resources
Available free online via USF's subscription to Safari:
I will also present content from Mining the web by Soumen
Chakrabarti
and Introduction
to Information Retrieval by Manning, Raghavan, and Schültze.
No doubt that you'll find the following resource useful:
Compiling, Executing, and Jar'ing Java Code.
We have academic licenses (so far) for:
CS680 Mailing List
I will be sending important information to this mailing list. You are
required to sign up for this list. To sign up:
CS680 google group.
To post, email cs680@cs.usfca.edu.
Miscellaneous
Tardiness. Please be on time for class. It is a big distraction if you come in late.
Academic honesty. You must abide by the copyright laws of the
United States and academic honesty policies of USF. If told you may
for a particular project, use any code from the net that you find as
long as it does not violate the software's license. You may not
borrow code from other current or previous students. All suspicious
activity will be investigated and, if warranted, passed to the Dean of
Sciences for action.
Official text from USF: As a Jesuit institution committed to cura personalis- the care and
education of the whole person- USF has an obligation to embody and
foster the values of honesty and integrity. USF upholds the standards
of honesty and integrity from all members of the academic
community. All students are expected to know and adhere to the
University’s Honor Code. You can find the full text of the code
online at honor code.
The golden rule: You must never represent another person's work as your own.