Stat 540 Spring 1999. Statistical Computing.

Instructor. Richard Waterman.
email: waterman@compstat.wharton.upenn.edu
class homepage: http://www-stat.wharton.upenn.edu/~waterman/Teaching/540s99

Objectives.

Learn a programming language. Perl.
Learn web based computing (CGI, periodic and recursive web agents).
Process text to extract information.
Classical linear algebra and optimization techniques. (Thisted)
Computation for Bayesian methods.
Overview of other computing tools. Mathematica, S-Plus.

Structure.

The course will be split into 5 modules. Each module will focus on constructing a program to perform a specific task. The task itself will be broken down into components, and a set of components will be addressed in each class.

Modules.

1. Web based mastermind. Perl and CGI

i. Hello wide world. Perl, HTML, HTTP and CGI basics. Client/Server paradigm. Take a protocol (HTTP), an interface (CGI), a mark up language (HTML) and mix in a little Perl and you can do anything!
ii. Data structures in Perl. Scalars, lists and associative arrays.
iii. Subroutines, randomization.
iv. Saving "state"; "cookies".

2. Classification potpourri.

Set up a web based interface to offer a user the ability to select a classification method for data analysis. AKA, how to make your methodology available to the world.

We will learn the algorithms behind the classification techniques and construct a web based interface which allows a remote user to implement them.

i. Logistic regression.
ii. Neural networks.
iii. Boosting.

3. Web based agents for mining online patent databases.

We will design recursive web agents to mine technology patent databases. We will construct representations of patent "family trees". We will represent these trees efficiently and discuss models for tree features.

i. Regular expressions and pattern matching. What does

m@(\w+)://([^/:]+)(:\d*)?([^#]*)@

do for you?
ii. Agent construction. The LWP module. OOP in Perl.
iii. Tree representation and analysis.

4. Click by click - collecting, representing and analyzing web click data.

We will customize the apache web server's log files to allow a click by click analysis of a users progression and action through a web site.

i. Massive data set issues.
ii. Graphs and their representations.
iii. Algorithms.

5. The tower of Babel - Yahoo's stock bulletin boards.

Is there any useful information out there? We will design, periodic and adaptive agents to retrieve all posts, then extract and statistically analyze their content. We will obtain real time stock quotes and correlate boards with markets.

i. Language feature extraction.
ii. Document classification - Bayesian models & Markov Chain Simulation.

Grading.

Weekly homework will be given. No late homeworks will be accepted. Each homework involves writing a program to carry out a specified set of tasks. If the tasks are correctly performed the homework gets a 1, otherwise it gets a zero. Students with 12 points at the end of the semester get an A, those with 9 to 11 get a B. Less than 9 gets a C.

Books.

Required text. Programming Perl. Larry Wall, Tom Christiansen & Randal L. Schwartz. O'Reilly. Second Edition 1996.

Optional and suggested books.

* Thisted.

* Bayesian Data Analysis. Gelman, Carlin, Stern and Rubin. Chapman and Hall. 1995.

* Web client Programming with Perl . Clinton Wong. O'Reilly. 1997.

Richard Waterman
Fri Jan 15 00:34:47 EST 1999