Stat 601, Fall 2001, Class 1

Introduction




The context

*
Premise: all business becomes information driven
*
Competitiveness: how you collect and exploit information to your advantage
*
The challenges
*
Most corporate data systems are not ready.
*
Can they share information?
*
What is the quality of the information going in
*
Most data techniques come from the empirical sciences; the world is not a lab.
*
Cutting through vendor hype, info-topia.
*
Defining metrics, abandoning gut rules of thumb is not a safe path for the manager.
*
Communicating success, setting the right expectations.

Objectives of Stat 601

*
Recognize where and how information analysis can feed into your business
*
Strategy driven by hard information?
*
Change emphasis toward interpretation and practical application
*
The importance of graphics in informing analyses
*
Enjoy it

Quick review of the syllabus

*
Course material
*
Grading/assessment
*
TA's and office hours??
*
Evaluations
*
Computing

Metaphor; the spoken language, not the grammar.

Course overview

*
Material
*
Classes 1-4. Understanding/measuring variability. Why it is important. Factor in variability/uncertainty to the decision making process
*
Who cares? The Basel Accord
*
Risky investments need higher reserves
*
Need to measure risk. e.g. J.P.Morgan
*
Risk == volatility of returns
*
Volatility == variability
*
The Four Book average - smoothing to reduce noise
*
Classes 5-10. Regression/statistical modeling/forecasting/explaining variability
*
Models
*
Stock market
*
Market share
*
Real estate prices
*
What's different? Our model explicitly incorporate variability; don't just get to model the process, get to say how good the model is. Meta-information: statements about the quality of information. Value added - the idea of precision.
*
What to get out of the course
*
Perform statistical analysis - hands on
*
No stats background - not math based
*
PRACTICAL APPLIED MODERN STATISTICS
*
Success in the course
*
Learn the right questions to ask
*
Critical evaluation of another's analysis
*
Confidence to perform analysis/use tools
*
Presentation and communication of results
*
Guarantee: you will be faced with more data, not less. This course is about evaluating, summarizing and leveraging information

Today's material

Chapter II

Chapter III

Chapter II

Basic statistical graphics and summaries

Box plot Identification of outliers
Histogram Shape of data, skewness. Outliers
Normal quantile plot Diagnostic for normality

Summary measures

  CENTER SPREAD
Sensitive to outliers Mean Variance/SD
Robust Median IQR

Definitions and notation

*
Mean = average. True $\mu$, estimated $\overline{x}$
*
Median = order the data, the one in the middle. Not standard.
*
Variance = average squared distance from the mean. True $\sigma^2$, estimated s2
*
S.D. = $\sqrt{{\rm Variance}}$. True $\sigma$, estimated s
*
IQR = 75 pctile - 25 pctile. Not standard.

Shapes of distributions/histograms

*
Symmetric bell shaped; mean $\sim$ median
*
Right skew; mean greater than median
*
Left skew; mean smaller than median

*
Symmetric bell shaped - good news.
*
Skewness - watch out!

The empirical rule

If data bell shaped and symmetric then say approximately normal.

Key: the mean and standard deviation summarize the data efficiently in these circumstances.

The EMPIRICAL RULE rule applies when data is approximately normal.

Rule of thumb for normal data - it ties together the mean and standard deviation, ($\mu$ and $\sigma$) into a rule that establishes where most of the data should lie. If the data is outside this range then it's an ``atypical'' observation; in J.P. Morgan's terminology an adverse market move.


Special one: $1.645 \times \sigma$ gives a 10% chance of falling out of the range. That is 5% on each side (tail), one in 20 times we see the lower event, about 1 trading day a month.


Chapter III

Sources of variability ... the idea

*
Variables that explain structure in the data
*
Segmentation/Aggregation in marketing
*
The ``style'' of a portfolio manager
*
Trends over time
*
Looking for leads in the data to explain its structure
*
Characterize the good prospects

Terminology

*
Capable process: meets design specs - engineering view.
*
In control process: no trend in mean or variance - statistical view.
*
In control - more of a monitoring concept

Multiple boxplots

*
An excellent way to view data, broken out by a second variable, e.g. sales by region, or employee evaluation by age.
*
Look for: differences between the medians, and differences in length, center and spread.

Ten rules for data analysis

*
1. Always, always plot your data
*
2. If its recorded against time, plot it against time

Review

*
Summary measures
*
Robust vs. Sensitive
*
Empirical rule for mound shaped and symmetric data
*
Ties together mean and s.d. to help define an ``unusual event''
*
Disparate data may be approx normal, ie GMAT and GM
*
But not ALL data is normal, ie Eisner's compensation.



Subsections


2001-09-06