Instructor. Richard Waterman.
email: waterman@compstat.wharton.upenn.edu
class homepage: http://www-stat.wharton.upenn.edu/~waterman/Teaching/540s99
The course will be split into 5 modules. Each module will focus on constructing a program to perform a specific task. The task itself will be broken down into components, and a set of components will be addressed in each class.
i. Hello wide world. Perl, HTML, HTTP and CGI basics. Client/Server paradigm.
Take a protocol (HTTP), an interface (CGI), a mark up language (HTML) and
mix in a little Perl and you can do anything!
ii. Data structures in Perl. Scalars, lists and associative arrays.
iii. Subroutines, randomization.
iv. Saving "state"; "cookies".
We will learn the algorithms behind the classification techniques and construct a web based interface which allows a remote user to implement them.
i. Logistic regression.
ii. Neural networks.
iii. Boosting.
We will design recursive web agents to mine technology patent databases. We will construct representations of patent "family trees". We will represent these trees efficiently and discuss models for tree features.
i. Regular expressions and pattern matching. What does
m@(\w+)://([^/:]+)(:\d*)?([^#]*)@do for you?
We will customize the apache web server's log files to allow a click by click analysis of a users progression and action through a web site.
i. Massive data set issues.
ii. Graphs and their representations.
iii. Algorithms.
Is there any useful information out there? We will design, periodic and adaptive agents to retrieve all posts, then extract and statistically analyze their content. We will obtain real time stock quotes and correlate boards with markets.
i. Language feature extraction.
ii. Document classification - Bayesian models & Markov Chain Simulation.
Weekly homework will be given. No late homeworks will be accepted. Each homework involves writing a program to carry out a specified set of tasks. If the tasks are correctly performed the homework gets a 1, otherwise it gets a zero. Students with 12 points at the end of the semester get an A, those with 9 to 11 get a B. Less than 9 gets a C.
Required text. Programming Perl. Larry Wall, Tom Christiansen & Randal L. Schwartz. O'Reilly. Second Edition 1996.
Optional and suggested books.
* Thisted.
* Bayesian Data Analysis. Gelman, Carlin, Stern and Rubin. Chapman and Hall. 1995.
* Web client Programming with Perl . Clinton Wong. O'Reilly. 1997.