STATISTICS 541, Fall Semester 2008, Course Web Page
Homework Assignments:
- Book recommendations:
- For regression:
Seber and Lee,
"Linear Regression Analysis" (Wiley Series in Probability and Statistics)
- For linear algebra:
Strang,
"Linear Algebra and its Applications" (Academic Press)
Strangely, the most fundamental material is tucked away in the Appendix:
"Linear Transformations, Matrices, and Change of Basis." If you are not fluent
at basis changes and associated coordinate transformations, study these 11 pages with great care!
- Homework 1, :
R practice.
-
Unless instructed otherwise, homeworks should be e-mailed as raw .txt
files in attachments to stat541.at.wharton[at-sign]gmail.com
Your checked and graded solutions are returned in e-mail attachments.
Search '#AB' to find comments. A score such as 8/10 at the end means
'8 out of 10 points'. A deduction of 2 points does not mean you got
two questions wrong; it is only a relative measure of how much below
optimal your solutions are.
Course Materials:
Datasets:
- Spelling dictionary
- Speech fragment (3.3 mB!)
- Boston housing data
with a description of the variables
Geographical names are in this file.
- Pima Indians diabetes data
with a description of the variables
- Laser data
with a description of the variables
- Titanic survival data
- Marketing data
- Tips data; analysis
- Detergent data
- Places Rated data with a description
In a form suitable for xgobi or ggobi:
data,
column labels,
row labels.
Say 'xgobi places_ggobi' or 'ggobi places_ggobi', depending on which you're using.
- Accounting-Market Rate data (Myers, p.16ff)
- Odd regression data
- Fabric failure data (Myers, p.329f)
- Vaso-restriction data (Myers, p.331f)
- FedEx data, reduced for elasticity demo
- Blocks data, reduced for fixed/variable cost demo
- Algal bloom data; some background is in the
README file
- Webserver data
- Wine displayspace data
- Car Models 2003-4
- Body Fat data, variable names,
README file,
R Functions:
IMPORTANT: If a function in the class notes does not work or is not
found in your R session, check whether the function is in one
of the R code files below. If so, download and read the file into R
one more time, even if you thought you had done so earlier.
I allow myself to update the code all the time.
Background Papers:
- The paper on tree-based regression and classification
is in this PDF file.
- The paper on additive principal components is in this
PDF file.
Syllabus
- Instructor: Andreas Buja, stat541.at.wharton"at-sign"gmail.com
Office hours: Monday, 2-4pm, or by appointment.
Office: JMHH 471
Class Room: JMHH F92
- The goal of the course is to prepare statistics Ph.D. students for
independent research. Accordingly, the demands on conceptual thinking
and quick uptake will be considerable as the course progresses.
Homework will approach the level of research problems.
- The grade will be computed from homeworks and class participation alone.
There will be no midterm or final exams.
- Topics:
- Exploratory data analysis, data visualization
- Statistical testing exemplified with permutation tests
- Confidence intervals with bootstrap
- Non-parametric curve fitting
- Cross-validation for estimating prediction errors and selecting bandwidths and models
- Bias-variance trade-offs
- Tree-based regression
- Linear models
- Additive models
- Exponential family models
- Classification
- Writing for research, including style and clarity, typesetting, web publishing
- As the tool of choice for the execution of data analyses and
simulations, we use the
R programming language.
R will be taught, but this is not an ``R class.''
- Prerequisites will include
- a course in linear algebra (linear
maps, inner product spaces, orthogonal projections, eigen
decompositions);
- statistical inference (including statistical
tests, confidence intervals, linear models and maximum likelihood
estimation);
- programming experience in some language (examples:
C, Fortran, Perl, Python, Visual Basic, R, Splus, Matlab,...).
- A word to undergraduate students contemplating this course: If
you do not have a solid background in statistics already, you should
not take this class or, at a minimum, not rely on credit from it for
graduation. As mentioned above, the goal is to prepare students for
statistics research, and there will be only one standard of
performance for all students.
- Writing is of utmost importance for research. To get used to the
standards of writing research papers in statistics, some homeworks will be
required to be typeset in LaTex and submitted by e-mail as a
PDF file.
- The only required texts are the following, one of which not about statistics but about
writing:
- Required web documents:
- If you need more reading about R, look up the numerous
books about R
or the numerous free
web documents about R.
Yet another way to find R introductions is to do a search for
"Introduction to R". If you find something particularly useful, please, let me know.
- Recommended texts:
- Venables and Ripley,
"Modern Applied Statistics with S-Plus" (Springer)
a recommended, terse book on a broad array of appl. stats. topics,
based on the S/R language.
- Becker, Chambers, Wilks,
"The New S Language"
the original S book, also called "the blue book";
even for R users still a good place to start.
- Chambers, Hastie (eds.)
"Statistical Models in S"
on the statistical modeling language in S/R, also called "the white book"
- R. L. Harris,
"Information Graphics,
an excellent overview of useful and common data visualizations.
As we go along, special topics books will be recommended.
- This semester there will be a TA for Stat 541: Michael Freiman, mfreiman@wharton.upenn.edu