STATISTICS 541, Fall Semester 2009, Course Web Page
Homework Assignments:
General honor code: You may discuss the problems with each
other in general terms, but you must write your own solution. All
sources, including friends and colleagues, must be cited. It is
important to get used to a stringent code of conduct in scientific
writing. On the other hand, use commonsense and attribute where
honesty requires it. Two points worth special mention:
*** If you received an extension for a homework, do not consult
posted solutions.
*** An offense with severe consequences anytime would be consulting
solutions of homeworks from previous years.
- Homework 1 (Solutions):
R practice; see the "Introduction to R" below.
Students who received an extension must not consult the solutions (honor code).
Graded files have been returned by email in attachments. (Apologies for
the extension '.R'; please, change to '.txt' for viewing in an editor.)
Search '##AB' to find comments. Apologies for the occasional nitpicking.
To truly learn to 'think in R' it is important to study the
solutions.
Some of you will find themselves ackowledged for novel solutions
not previously known to this instructor.
- Homework 2 (Solutions):
linear algebra and latex practice;
edit the LaTex source and submit a PDF file by email.
If you have never used LaTex, you can first install some free software:
In the manual, pay special attention to Section 1.3.2 (special
characters) and Chapter 3 for math typesetting (math symbols:
Section 3.10). To produce PDF from LaTex, the LEd environment
requires you to click the green and blue right arrows in the tool
bar. Feel free to check out other free software and other
documents. If you find something particularly useful, please,
let me know.
- Homework 3 (Solutions):
R practice, string manipulations and analysis.
- Homework 4 (Solutions):
Linear algebra and linear models;
edit the LaTex source and submit a PDF file by email.
- Homework 5:
Q-R decompositions and Least Squares computations in R.
- Homework 6: Writing exercise, Chapter 5 in "Style".
- Homework 7
(data, template):
real estate project.
- TA: Yuzhou Liu --- Office hour: Thursdays, 2-3pm, JMHH 440
Instructor's office hours: Monday, 4:30 after class
-
Unless instructed otherwise, homeworks should be e-mailed
in attachments to stat541.at.wharton[at-sign]gmail.com.
The format should be .txt or .pdf or .doc depending on
the assignment.
Your checked and graded solutions are returned in e-mail attachments.
Search '#AB' to find comments. A score such as 8/10 at the end means
'8 out of 10 points'. A deduction of 2 points does not mean you got
two questions wrong; it is only a relative measure of how much below
optimal your solutions are.
Course Materials:
Datasets:
- A small book
- Spelling dictionary
- Speech fragment (3.3 mB!)
- Boston housing data
with a description of the variables
Geographical names are in this file.
- Pima Indians diabetes data
with a description of the variables
- Laser data
with a description of the variables
- Titanic survival data
- Marketing data
- Tips data; analysis
- Detergent data
- Places Rated data with a description
In a form suitable for xgobi or ggobi:
data,
column labels,
row labels.
Say 'xgobi places_ggobi' or 'ggobi places_ggobi', depending on which you're using.
- Accounting-Market Rate data (Myers, p.16ff)
- Odd regression data
- Fabric failure data (Myers, p.329f)
- Vaso-restriction data (Myers, p.331f)
- FedEx data, reduced for elasticity demo
- Blocks data, reduced for fixed/variable cost demo
- Algal bloom data; some background is in the
README file
- Webserver data
- Wine displayspace data
- Car Models 2003-4
- Body Fat data, variable names,
README file,
- FTSE Data 1991-05-13 through 2006-05-11
R Functions:
IMPORTANT: If a function in the class notes does not work or is not
found in your R session, check whether the function is in one
of the R code files below. If so, download and read the file into R
one more time, even if you thought you had done so earlier.
I allow myself to update the code all the time.
Background Papers:
- The paper on tree-based regression and classification
is in this PDF file.
- The paper on additive principal components is in this
PDF file.
Book recommendations:
- For regression:
Seber and Lee,
"Linear Regression Analysis" (Wiley Series in Probability and Statistics)
- For linear algebra:
Strang,
"Linear Algebra and its Applications" (Academic Press)
Strangely, the most fundamental material is tucked away in the Appendix:
"Linear Transformations, Matrices, and Change of Basis." If you are not fluent
at basis changes and associated coordinate transformations, study these 11 pages with great care!
Syllabus STAT 541
- Instructor: Andreas Buja, stat541.at.wharton[at-sign]gmail.com
Office hours: by appointment.
Office: JMHH 471
Class Room: JMHH F36
- The goal of the course is to prepare statistics
Ph.D. students for independent research. Accordingly, the demands on
conceptual thinking and quick uptake will be considerable as the
course progresses.
- Homework will be extensive and approach the level of
research problems. The weekly homeworks will be the heart of what
you retain from this course.
- Grades will be computed from homeworks and class
participation alone. There will be no midterm or final exams.
- Topics:
- Linear Models and inference
- Exploratory data analysis, data visualization
- Statistical testing exemplified with permutation tests
- Confidence intervals with bootstrap
- Tree-based regression
- ACE and additive models
- Principal Components
- Non-parametric curve fitting
- Bias-variance trade-offs
- Cross-validation for estimating prediction errors and selecting bandwidths and models
- Writing for research, including style and clarity, typesetting, web publishing
- As the tool of choice for the execution of data analyses and
simulations, we use the
R programming language.
R will be taught, but this is not an ``R class.''
- Prerequisites will include
- a course in linear algebra (linear
maps, inner product spaces, orthogonal projections, eigen
decompositions);
- statistical inference at the level of Stat 431 (including statistical
tests, confidence intervals, linear models, estimation, sufficiency);
- programming experience in some language (examples:
R, Splus, Matlab, Perl, Python, Visual Basic; C, Fortran, ...).
- A word to undergraduate students contemplating this course: If
you do not have a solid background in statistics already, you should
not take this class or, at a minimum, not rely on credit from it for
graduation. As mentioned above, the goal is to prepare students for
statistics research, and there will be only one standard of
performance for all students.
- Writing is of utmost importance for research. To get used to the
standards of writing research papers in statistics, some homeworks will be
required to be typeset in LaTex and submitted by e-mail as a
PDF file.
- The only required texts are the following, one of which not about statistics but about
writing:
- Required web documents:
- If you need more reading about R, look up the numerous
books about R
or the numerous free
web documents about R.
Yet another way to find R introductions is to do a search for
"Introduction to R". If you find something particularly useful, please, let me know.
- Recommended texts:
- Venables and Ripley,
"Modern Applied Statistics with S-Plus" (Springer)
a recommended, terse book on a broad array of appl. stats. topics,
based on the S/R language.
- Becker, Chambers, Wilks,
"The New S Language"
the original S book, also called "the blue book";
even for R users still a good place to start.
- Chambers, Hastie (eds.)
"Statistical Models in S"
on the statistical modeling language in S/R, also called "the white book"
- R. L. Harris,
"Information Graphics,
an excellent overview of useful and common data visualizations.
As we go along, special topics books will be recommended.