Final Project, Stat 540 Spring 1999.

Deliverables (by 05/07/99) :

A. A 2-3 page report, describing your approach to the classification problem. It should include a brief statement of the problem, a description of the covariates that you have used in your classifier, the chosen classifier itself and a summary of how well your classifier works on the training dataset.
B. Just after the end of the reading week, I will provide a copy of the validation data set. You should then run your classifier on this, and provide me with a list of predicted classifications by 05/07/99.

Here is a link to the training data set: bull.txt.

A. Recall that the objective of this homework is to take a corpus of bulletin board posts and classify them into one of either of 2 categories, Bullish or Other.

A sample of raw text documents can be found in the directory

 /~waterman/public_html/Teaching/540s99/Posts

Further, in this directory is a file named idtc.txt which contains an additional set of potential covariates describing the daily history of the stock price.

View each bulletin board post as an observation. You must extract from it a set of covariates of your own choosing and construction, that may be useful for the classification.

The necessary step to take is to construct a "spreadsheet" of classes and covariates, on which to train your classifier. Your classifier could be a logistic regression, neural network, or tree classifier. You may wish to try boosting your learner.

Last update 4/15/99.