STATISTICS 961, Fall Semester 2019, Course Web Page
- Syllabus
- Instructor's office hours:
- Goals and Non-Goals:
- There is just one goal -- prepare 1st year PhD students in statistics for research.
- This is not an applied statistics course.
- This is not an R course. Fluency in R is assumed.
- If you are not a 1st year PhD student in statistics, please, note:
- Undergraduate and MBA students require an interview with the
instructor.
- The course grade is heavily based on class participation (35%).
- PhD students from programs other than statistics
should not count on completing this class and therefore sign
up for sufficient credit from other courses.
- Course Selection Period ends on Tuesday, 2019/09/10.
- Drop Period ends on Monday, 2019/10/07.
Homework Assignments:
General honor code: You may discuss the problems with each
other in general terms, but you must write your own solution. All
sources, including friends and colleagues, must be cited. It is
important to get used to a stringent code of conduct in scientific
writing. On the other hand, use commonsense and attribute where
honesty requires it. Two points worth special mention:
*** If you received an extension for a homework, do not consult
posted solutions.
*** An offense would be consulting solutions of homeworks from
previous years.
*** An exception is with regard to LaTex and English language help:
Avail yourself to as much as you need from whichever source.
- Homework 1 : Linear algebra (1) and Latex practice.
Edit the LaTex source of Homework 1.
Read the instructions carefully.
If you have never used LaTex, you may want to consult with more
senior students how to set yourself up and get a gentle introduction,
or you can install free software and documentation from
the following sources:
In the manual, pay special attention to Section 1.3.2 (special
characters) and Chapter 3 for math typesetting (math symbols:
Section 3.10). To produce PDF from LaTex, the LEd environment
requires you to click the green and blue right arrows in the tool
bar. Feel free to check out other free software and other
documents. If you find something particularly useful, please, let
the instructor know.
- Unless instructed otherwise, homeworks should be e-mailed in attachments to stat961.at.wharton[at-sign]gmail.com.
- The format should be .R or .pdf or .doc depending on the assignment.
- Your checked and graded solutions are returned in e-mail attachments.
Search '#AB' to find comments.
A score such as 8/10 at the end means '8 out of 10 points'.
A deduction of 2 points does not mean you got two questions wrong; it is
only a relative measure of how much below optimal your solutions are.
Course Materials:
Datasets:
- LA homeless data (courtesy of Richard Berk)
- FTSE Data 1991-05-13 through 2006-05-11
- A small book
- Spelling dictionary
- Speech fragment (3.3 mB!)
- Boston housing data
with a description of the variables
Geographical names are in this file.
- Pima Indians diabetes data
with a description of the variables
- Laser data
with a description of the variables
- Titanic survival data
- Marketing data
- Tips data; analysis
- Detergent data
- Places Rated data with a description
In a form suitable for xgobi or ggobi:
data,
column labels,
row labels.
Say 'xgobi places_ggobi' or 'ggobi places_ggobi', depending on which you're using.
- Accounting-Market Rate data (Myers, p.16ff)
- Odd regression data
- Fabric failure data (Myers, p.329f)
- Vaso-restriction data (Myers, p.331f)
- FedEx data, reduced for elasticity demo
- Blocks data, reduced for fixed/variable cost demo
- Algal bloom data; some background is in the
README file
- Webserver data
- Wine displayspace data
- Car Models 2003-4
- Body Fat data, variable names,
README file,
R Functions:
IMPORTANT: If a function in the class notes does not work or is not
found in your R session, check whether the function is in one
of the R code files below. If so, download and read the file into R
one more time, even if you thought you had done so earlier.
I allow myself to update the code all the time.
Background Papers:
- The paper on tree-based regression and classification
is in this PDF file.
- The paper on additive principal components is in this
PDF file.
Syllabus STAT 961
- Instructor: Andreas Buja
Email: stat961.at.wharton[at-sign]gmail.com
Office hours by appointment.
Office: JMHH 471
Class Time: Mon+WEd, 3:00-4:30pm
Class Room: 105 SH-DH
- Goal of this course: Prepare statistics Ph.D. students for independent research.
Accordingly, the demands on conceptual thinking and quick uptake will be considerable as the course progresses.
What this course is not:
- not an applied statistic course,
- not a R course, and
- not a service course to other departments.
- There will be homework at irregular times. It may be
laborious for some of you because it attempts to bring you up to
speed on background material that you may not be familiar with, in
particular in matters of linear algebra.
- In-Class Quizzes will be held sometimes announced and
sometimes unannounced. Students who miss a class with a quiz will
make up at a later date. All students are under honor code not to
exchange information about the quiz with a student who will make up
later.
- Grades will be computed from homeworks, in-class quizzes
and class participation alone.
There will be no midterm and no final exams.
- Topics (not necessarily in this order):
- Linear Models and inference
- Diagnostics and general ideas of turning them into tests
- Multiplicity and replicability problems, frequentism
- Permutation tests
- Confidence intervals with bootstrap
- Non-parametric curve fitting
- Bias-variance trade-offs
- Cross-validation for estimating prediction errors and selecting bandwidths and models
- ACE and additive models
- Tree-based regression
- Exploratory data analysis, data visualization
- Principal Components
- Writing for research, including style and clarity, typesetting, web publishing
- Computing: As the tool of choice for the execution of data
analyses and simulations, we use the
R programming language.
Note: This is not an R class. R will not even
be taught in light of the computational literacy of this year's
statistics Ph.D. students.
- Prerequisites:
- A course in linear algebra:
basis changes and associated coordinate changes, linear maps,
inner product spaces, orthogonal projections, eigen
decompositions
- Probability at the level of Stat 430:
thinking in random variables, limit theorems
- Statistical inference at the level of Stat 431:
statistical tests, confidence intervals, linear models,
estimation, sufficiency
- Programming experience in R
- Undergraduate students contemplating this course: If you
do not have a solid background in statistics and linear algebra
already as well as R programming experience, you should not take
this class or, at a minimum, not rely on credit from it for
graduation. As mentioned above, the goal is to prepare students for
statistics research, and there will be only one standard of
performance for all students.
- Publication quality writing and mathematical typesetting are of
utmost importance for statistics research. To get used to the
standards of writing research papers in statistics, some homeworks
will be required to be typeset in LaTex and submitted by e-mail
as a PDF file. You will have to learn LaTex on your own with the help
of other graduate students, but getting started with Latex
will be facilitated by templates provided by the instructor, so all
you need to do is cannibalize the templates by filling in your
solutions.
- Recommended web documents:
- If you need more reading about R, look up the numerous
books about R
or the numerous free
web documents about R.
Yet another way to find R introductions is to do a search for
"Introduction to R". If you find something particularly
useful, please, let the instructor know.
- Other recommended texts:
- For regression: Seber and Lee, "Linear Regression Analysis"
(Wiley Series in Probability and Statistics)
- For linear algebra: Strang, "Linear Algebra and its Applications" (Academic Press)
Strangely, the most fundamental material is no longer in the recent edition:
"Linear Transformations, Matrices, and Change of Basis."
In older editions this used to be tucked away in the appendix.
For this reason, the material is now included in Homework 2:
You get to derive it yourself by following instructions.
- Venables and Ripley,
"Modern Applied Statistics with S-Plus" (Springer) a
recommended, terse book on a broad array of
appl. stats. topics, based on the S/R language.
- Becker, Chambers, Wilks,
"The New S Language"
the original S book, also called "the blue book";
also for R users still a good source. Many R help pages refer to it.
- Chambers, Hastie (eds.)
"Statistical Models in S"
on the statistical modeling language in S/R, also called "the white book"
- R. L. Harris,
"Information Graphics",
an excellent overview of useful and common data visualizations.
See als some popular TED talks by
by Rosling,
and by McCandless.
As we go along, special topics books will be recommended.
- Supplemental material:
- A fabulous and fun article on R by Patrick Burns:
R Inferno.
Print it and keep it on your bedside!
- Here is a pointer to R blogs.