---------------------------------------------------------------- RECAP LECTURE 1: - Syllabus: * Problems with timing of Midterm 2 will be looked into * TA hours will be in Syllabus shortly. * webCafe: calendar, course materials, homeworks, discussion board - Rules: * No cell phone use (including text messaging) * No laptop use, unless explicitly permitted (recommended note taking: print script before class, annotate in class, enter annotations in file after class) * No disrespect, no disruption of learning --- BE CONSIDERATE TO OTHERS --- - Diagnostics Quiz: * Page 1: You didn't have to know these things now. You should know them at the end of the semester. * Page 2: As an informed citizen and graduate of a good college, you should knew these things. - What is statistics about? data, data, data, data, ... - What is statistics good for? * It provides a job skill. * It also provides an outlook on the world: the role of uncertainty, randomness. - Data, data, data, data, ... list of first examples - Format of data for statistics: tables describing objects and their characteristics -> CASES and VARIABLES ---------------- ROADMAP LECTURE 2: - Classification of variables: qualitative -- quantitative nominal, ordinal -- discrete, continuous (interval, ratio) also: time and space as variables - Discovery techniques: plots (graphs, charts) * Types of graphs to be shown: Bar plots, mosaic plots, histograms, boxplots, scatterplots, comparison boxplots, time series plots, geographic maps with glyphs * Data examples to be used: Titanic data, CEO compensation data, car mileage data, presidential approval ratings, metro area climate data ---------------------------------------------------------------- RECAP LECTURE 2: - Remember rules: * RESPECT, CONSIDERATION (no disrespect & no disruption) * no electronic devices, no communication with the outside (IM, email, text msg.) - Statistics is all about data und uncertainty - Classification of variables - Exploration of data with plots - Bar plots, mosaic plots for purely qualitative variables [Note: The script of Module 1 has been somewhat updated.] ---------------- ROADMAP LECTURE 3: - Further types of plots * histograms * boxplots * scatterplots * comparison boxplots Goal: Be prepared for Homework 1, posted on webCafe, due in a week. ---------------------------------------------------------------- RECAP LECTURE 3: - Types of variables: qual. -- quant. - Data exploration: plots JMP: highly interactive plots... - Variable type drives plot type 1 qual -> bar plot 1 quant -> histogram + boxplot 2 qual -> mosaic plot 2 quant -> scatterplot 1 qual (X), 1 quant (Y) -> comparison boxplot - JMP: 1 var -> 'Analyze > Distribution' 2 vars -> 'Analyze > Fit Y by X' - Interactive manipulations on plots in JMP: * plot resizing (grab bottom right of plotting area) * axis rescaling (grab either end of axis area) * axis shifting (grab center of axis area) * histogram bin widths ('Grabber', drag in histogram area) - Selecting cases and linking plots * identifying points: move cursor over point, label appears * selecting: . clicking bars . clicking points (one) or sweeping a rectangle (many) while selecting to accumulate selections * to color or label selected points: Rows > Label/Unlabel, Markers, Colors * linking: selections show up in spreadsheet and all plots ---------------- ROADMAP LECTURE 4: - Time and space: time series plots and maps with markers - Module 2: Numeric summaries of variables * Numeric summaries, too, are driven by variable type. - Qualitative variables: counts & proportions - Quantitative variables: measures of location and dispersion - Converting these measures from one unit to another ---------------------------------------------------------------- ANNOUNCEMENTS: - Instructor's office hours are moved to Fri, 1-3:30pm - Homework 1 due today, no later than 5pm - Quiz 1: Wed, Jan 31, first thing in class Rules: closed notes, closed book, [Memorization of conventions is part of the quiz.] You must take the quiz in your section! Form: ~10 min 5 multiple choice questions 4 choices each JMP output may be shown and questions may be asked about it. RECAP LECTURE 4: - Time series plots: scatterplots with X=time Y = quantitative variable but connect dots (curve) example: presidential approval ratings over time - [General warning against visual cheats: if Y is a ratio scale variable, show zero] - Maps made of markers: if sufficiently many recognized locations are given, outlines may be unnecessary trick: use marker colors/shapes/sizes to code variables of the locations example: metropolitan areas, climate index - Module 2: Numerical summaries - Summaries of qualitative variables: counts/frequencies and proportions/percentages - Summaries of quantitative variables: two types: location and dispersion measures (where? how wide?) - Location measures: mean, median, quantiles - Changing units: location measures change units like the raw values (small hiccup: when the units change signs, quantiles flip upper/lower) ---------------- ROADMAP LECTURE 5: - Dispersion measures: range, interquartile range, standard deviation - Changing units: intuitively, dispersion ~ width width does not change when shifting values, but width changes when blowing up or shrinking values - Understanding standard deviation: some messiness ---------------------------------------------------------------- REMINDER: Quiz 1 on Wed, Jan 31 Must be taken with your own section. (Sec 001: 10:30am; Sec 002: 1:30pm) ---------------- RECAP LECTURE 5: - Numeric summaries of variables: * Location measures: mean, median, quantiles,... * Dispersion measures: range, IQR, SD (=s) - Behavior of measures under change of units: * Location measures transform like raw values. * Dispersion measures scale but ignore shift. (Scale factor = absolute value of multiplier) - Oddities of the SD: * squares and root * N-1 instead of N - JMP: Distribution > ... reports mean, quantiles, SD Range and IQR are easily obtained from quantile list. ---------------- ROADMAP LECTURE 6: - Module 3: Measuring assocation between two quantitative variables - the trap of causal thinking - observational data: we cannot infer what causes what - association between X and Y = pattern of how X and Y values are matched - what does it mean if there is no association? the notion of randomness and random association - measuring the strength of linear association: CORRELATION ---------------------------------------------------------------- ANNOUNCEMENTS: Midterm 1 - Wed, Feb 7, 6-8pm - Rooms to be announced on Monday (webCafe, class) - 30 questions, multiple choice, 4 choices - Material: Modules 1 + 2 + 3, Homework 1 (no JMP instructions) - closed books and closed notes but: permission to bring 1 letter size sheet of notes hand-written or computer printed on both sides - pocket calculator permitted Homework 2 - will be issued on Friday, Feb 2, on webCafe - due on Friday, Feb 9, 5pm ---------------- RECAP LECTURE 6: - Theme: Association between two quantitative variables - Preliminaries 1: assumption of causality to be avoided When we see association, we usually don't know causation. Ex.: falling and breaking bones in elderly - Preliminaries 2: causality can be derived from * controlled experiments * strong theory (some theories are REALLY strong) - 'Observational' data * pervasive * cannot infer causality between variables Ex.: popping pills and health - Definition of 'Association': the pattern in which X and Y values are matched - Use of observational data: PREDICTION * If X and Y are associated => knowing X gives knowledge about Y * This holds even if X is not the cause of Y. * The knowledge that X gives about Y is usually only an weak, in a gambling sense: If someone's verbal SAT is low, it is somewhat more likely that the math SAT is not high. - Types of association: * pos - neg * linear - nonlinear (curved) * continuous - clustered (clumped) More than one of these types can be present. - What is 'no association'? A: The matching pattern of X and Y values is random. computer experiment: randomly match X with Y values ---------------- - Measuring the strength of linear association: CORRELATION cov(X,Y) - Def: cor(X,Y) = ---------- s(X) s(Y) - Properties: * cor(X,Y) is between -1 and +1 * cor(X,Y) = +1 iff Y = aX + b, a>0, perfectly * cor(X,Y) = -1 iff Y = aX + b, a<0, perfectly - Def: X and Y are said to be UNCORRELATED if cor(X,Y)=0 ---------------------------------------------------------------- ANNOUNCEMENTS: Midterm 1 - Wed, Feb 7, 6-8pm - Rooms: Section 001: Nursing Education Bldg AUD Section 002: Logan Hall 17 - 30 questions, multiple choice, 4 choices each - Material: Modules 1 + 2 + 3, Homework 1 (no JMP instructions) - closed books and closed notes but: permission to bring 1 letter size sheet of notes hand-written or computer printed on both sides - pocket calculators permitted (even with statistical functions) no text or image storing devices of any kind no communication devices of any kind - those who wish to wear a hat with edge or shade must sit in the front row - bring a #2 pencil for completing the bubble form ---------------- RECAP LECTURE 7: - getting rid of the `causality trap': x-y association means x-y pattern, not x->y cause - `no association' means `random association' - correlation measures `degree of linear association' with values between -1 and +1 - correlation in practice gives non-zero values even if the association is not linear but positive or negative (curved, clustered) - weirdness can happen to correlation in the presence of outliers or grouping ---------------- ROADMAP LECTURE 8: - a new angle on correlation: diagonal dispersion (width) var(x+y) - var(y-x) cor(x,y) = --------------------- 4 if x and y are z-scores - fitting straight lines to linearly associated pairs of variables . regression coefficients: slopes and intercepts . the Least Squares method . residuals . formulas for Least Squares regression ---------------------------------------------------------------- COMMENTS: webCafe Discussion - HW 2: Raleigh Temperatures -- mea culpa... - Quiz 1: savings acct balances - Please, report typos, mistakes in Modules, HWs,... Do not assume that everything is correct! - Welcome to ask questions that go beyond the material as presented ---------------- ANNOUNCEMENT: Quiz 2, Wed, Feb 14 - Material: Modules 3 and 4 ---------------- ANNOUNCEMENT: Midterm 1 - Today, Wed, Feb 7, 6-8pm - Rooms: Section 001: Nursing Education Bldg AUD Section 002: Logan Hall 17 - 30 questions, multiple choice, 4 choices each - Material: Modules 1 + 2 + 3, Homework 1 (no JMP instructions) - closed books and closed notes but: permission to bring 1 letter size sheet of notes hand-written or computer printed on both sides - pocket calculators permitted (even with statistical functions) no text or image storing devices of any kind no communication devices of any kind - those who wish to wear a hat with edge or shade must sit in the front row - bring a #2 pencil for completing the bubble form ---------------- RECAP LECTURE 8: - correlation as estimating the diagonal width, or difference of diagonal widths in standardized z-scores: var(y+x) - var(y-x) var(y-(-x)) - var(y-x) cor(x,y) = ------------------- = ---------------------- 4 4 if x and y are z-scores (no, this is not on the midterm...) - exact definition of "linear association" mean(y|x) = b * x + b <=> y is linearly associated with x 1 0 Note: definition is asymmetric NOT y is linearly associated with x <=> x is linearly associated with y NOT Illustration: Weight(y)-Height(x) data - Fitted lines: yhat = b * x + b 1 0 - Terminology related to fitted lines: . predictor variable, predictor, x-variable . response variable, response, y-variable . slope . intercept . estimates or predictions from a fitted line: yhat for a given x . simple linear regression = fitting straight lines = fitting linear equations ... to x-y data ---------------- ROADMAP LECTURE 9: - The Least Squares or LS method for fitting straight lines . residuals . residual sum of squares: RSS . two applets for illustration . why squares? why vertical distance? - Explicit formulas for LS estimates of slope and intercept . connection between slope and correlation . LS in standardized z-scores - The asymmetry of the regression problem: predicting y from x requires a different equation from predicting x from y (you can't just solve y=b1*x+b0 for x) ---------------------------------------------------------------- ANNOUNCEMENT: Midterm 1 - TAs are entering scores on webCafe - Stats and barplot are posted ---------------- RECAP LECTURE 9: ^ - Concepts: residuals = e = y - y i i i 2 RSS = sum e i - Least Squares (LS) fitting of straight lines: minimize RSS(b ,b ) 0 1 - Questions: * Why squares? * Why vertical distances from the line? - Simple Linear Regression in practice * Diamond data * interpretations of slope and intercept: slope = mean difference in y for a unit difference in x intercept = mean of y at x=0 (avoid causal trap) (beware of extrapolation) ---------------- ROADMAP LECTURE 10: - Use of simple linear regression for prediction: * JMP mechanics ^ * interpretation: y(x) = b * x + b = estimated mean of y at x 1 0 * beware of extrapolation! - More practice: * Cars data (city and highway driving) - Explicit formulas for LS estimates of slope and intercept * connections with correlation * regression on z-scores * x->y and y->x are not the same: proof in formulas - Changing units in a regression equation ---------------------------------------------------------------- QUIZ 2: - fill in name, section (top of page 1) - write answers in CAPITAL LETTERS in boxes on page 2 - 20 minutes ---------------- RECAP LECTURE 10: - Regression in practice: . Diamond data: weight -> price . Cars data: MPG Highway -> MPG City - Issues: . interpretation of slopes, intercepts, predictions . in all three cases: 'on average' in the response . danger of extrapolation (sometimes expected: diamond prices) - Formulas: . slope = corr(x,y) * s(y)/s(x) rescaled correlation, units(y/x) . intercept = ybar - b1*xbar . on z-scale: slope = corr, intercept = 0 . asymmetry x->y vs y->x visible in formula for slope (can't solve a prediction equation to reverse the prediction) - Unit conversions in regression equations . universal method: substitute old units with new units e.g., precip(mm) = 25.4 * precip(in) compare with target equation after some algebra, indentify slope and intercept in new units ---------------- ROADMAP LECTURE 11: - Measuring the quality of fit: * RSS * s = RMSE (residual standard deviation) e * R Squared: "fraction of variance explained" - Diagnostics: fitting curves to see whether fitting a line is reasonable - Prediction quality and prediction intervals ---------------------------------------------------------------- ANNOUNCEMENTS: - Homework 3 posted, due Fri, Feb 23, 5pm - Quiz 2 solutions are posted, scores on webCafe shortly - Factoid: income taxes and skew distributions ---------------- RECAP LECTURE 11: - Measuring the quality of fit: * RSS * s = RMSE (residual standard deviation) e * R Squared: "fraction of variance explained" - Diagnostics: fitting curves to see whether fitting a line is reasonable ---------------- ROADMAP LECTURE 12: - Residual plots: diagnosing problems with line fits - Prediction quality: idea: small residuals promise good predictions => use the RMSE as a measure of prediction quality - Prediction intervals: idea: use quantiles of the residual distribution => intervals around predictions - down the line: logarithms to describe multiplicative changes ---------------------------------------------------------------- ANNOUNCEMENT: - Homework 3 due this Friday, Feb 23, 5pm - Next week: Quiz ---------------- RECAP LECTURE 12: - Residual plot: for seeing problems in the residuals . plot of residuals e versus predictor x . amounts to pulling the line into horizontal position . there is a problem if x contains information about e example: small x => e positive \ medium x => e negative |=> convex curvature large x => e positive / - Prediction quality: Idea: Small residuals promise good predictions. => Use the RMSE as a measure of prediction quality (in principle any dispersion measure could be used to measure prediction quality) - Prediction intervals: Idea: Use quantiles of the residual distribution. If 19 out of 20 residuals are contained in the interval [lo,hi], then [b0 + b1*x - lo, b0 + b1*x + hi] has a chance of capturing about 19 out of 20 future observations (x,y). Example: Diamond.JMP, interval ~ -260 + 3720*Weight +- 78 - General conclusion: Extracting residuals is extremely useful. 1) residual plots for diagnostics 2) RMSE to measure prediction quality 3) residual distribution to construct prediction intervals ---------------- - Percentage and factor changes - Logarithms to additively describe multiplicative changes ---------------- ROADMAP LECTURE 13: - Properties of logarithms and the exponential function - Application of logarithms to regression: Transform certain non-linear associations to linear associations. - Business application 1: Exponential growth/decay, depreciation - Business application 2: Logarithmically diminishing returns - Business application 3: Constant elasticities ---------------------------------------------------------------- Announcements: - Quiz 4 on Wed, Feb 28 - Quiz material: Modules 1-4, Module 5 p. 1-8 - next week: Spring Break !!! - after this lecture we are past the half-way point of the semester (14 out of 27 lectures) ---------------- RECAP: - logarithms to map some types of non-linear association to linear associations - Type 1: exponential growth/decay or depreciation trick: ln(y) - Type 2: logarithmically diminishing returns trick: ln(x) - Type 3: constant elasticities, power laws trick: ln(x), ln(y) ---------------- ROADMAP LECTURE 14: - data examples for each - interpretations of equation - meaning of residuals, R Square and RMSE in each case - residual analysis ---------------- Trick questions: the ultimate diamond story - Where is the largest diamond anywhere? - How many large diamonds are there? ---------------------------------------------------------------- ANNOUNCEMENT: - next week is Spring Break ---------------- RECAP: - Data example of exponential growth: number of web servers Jan 1997 - Dec 2000 - JMP mechanics: two way to fit an exponential . Fit Y by X to x-y > Fit Special > log(y) shows a fitted exponential in x-y plot . create a column 'ln(y)' and Fit Line to x-ln(y) shows a fitted line in x-ln(y) plot same equation 'log(y) = b0 + b1*x' and same R Square and RMSE (both referring to x-ln(y)) - Interpretation of equation: ln(y) = b0 + b1*x => y = exp(b0) * exp(b1)^x => change factor is f = exp(b1) original amount at x=0 is Z = exp(b0) Ex.: b1 = 0.9 => f = exp(b1) = 2.46 => 146% growth/yr b0 = 13.4 => Z = exp(b0) = 660,000 - R Square very high and yet residuals not random - interpretation of RMSE = 0.06: ln(y) - (b0+b1*x) ~ 0.06 y / exp(b0+b1*x) ~ exp(0.06) ~ 1.06 = 6% ---------------- ROADMAP LECTURE 15: - Logarithmically diminishing returns data example: liquor store chain display space and sales of new product - Elasticity data example: FedEx parcel service unit price and sales - Re-analysis of diamonds data with power law ---------------- GENERAL INTEREST: - yesterday's drop of the Dow by 3.5% - survey of young people: the narcissistic generation - Foreign Affairs: mean/median duration of civil wars - Anti-Oxydiants don't help you live longer: meta-analysis, contested ---------------------------------------------------------------- GENERAL INTEREST: - Yes, we invaded the country of Lichtenstein... - Insider trading cases ---------------- ANNOUNCEMENT: - Homework 4 posted, due Monday, March 19, 5pm - reminder for instructor: bug publisher about JMP CD ---------------- RECAP LECTURE 15: - logarithms, exponentials,... - idea, expressed in terms of quantitative variable types: . straight line fits work when x and y are interval scale . extend straight line fits to cases when x and/or y are ratio scale . logarithms transforms ratio scales to interval scales - business situations one can model with logarithms and straight lines: . exponential growth/decay: ln(y) ~ a + b*x . diminishing returns: y ~ a + b*ln(x) (b>0) . constant elasticities: ln(y) ~ a + b*ln(x) (b<0) ---------------- ROADMAP LECTURE 16: - Homework 4 guide lines: Honda Accord data fit, judge, interpret all three logarithmic models * ln(Price) ~ a + b*Age * Price ~ a + b*ln(Age) * ln(Price) ~ a + b*ln(Age) Steps: * show a plot of response versus predictor, logged, with straight line fit (Fit Line) and its residual plot * show a plot of raw response versus raw predictor, with curve fits (Fit Special, select log(x) or log(y) or both) and its residual plot * judge the fit visually: . over/under-predictions? . non-constant variation in the residuals? (residuals from Fit Y by X matter, not from Fit Special) . how well do you expect extrapolation to work for high Ages? * extract the fitted equation and show it up front * unreasonable predictions: check whether/where predicted Price becomes negative * interpret the slope * interpret the intercept * note and comment on R2 . is it high in absolute terms? . is it higher/lower than other models * interpret RMSE: . if y is un-logged (raw), then observed y is off the prediction on the order of +-RMSE . if y is logged, then ln(y) is off the prediction from the formula on the order of +-RMSE, hence, y is off the prediction exp(formula) by a factor on the order of exp(+RMSE) on the high side and exp(-RMSE) on the low side; translate the factors to percentages: y is off its prediction by (exp(+RMSE) - 1)*100% on the high side y is off its prediction by (1 - exp(-RMSE))*100% on the low side If RMSE < 0.1, use small-change approximations - re-analyze the Diamond data with ln(x) or ln(y) or both as a template for Homework 4 - crazy stories... - probability... . idealization . P(A) = limit of relative frequency of A in many independent repetitions ---------------------------------------------------------------- RECAP LECTURE 16: - Moral tale of insider trading: there is no joke there... It's wrong, and you're wrecking your life. - Logarithmic models: practice in HW 4 * HW 4, Problem 1: analyze with ln(x)-y, x-ln(y), and ln(x)-ln(y) according to the above protocol * IMPORTANT: Fit Line and Fit Special do the same thing. They only show the results in different ways. Example: What is a line in a plot of y vs ln(x) is a curve in a plot of y vs x!!! - Module 6: Probability quantifies uncertainty. - Weird stories: whale on the yacht, the ultimate "Homeward Bound",... ---------------- ROADMAP LECTURE 17: - Probability: * probability = limit of relative frequency as N -> Infinity (fine print to follow today) * Example: #{daily S&P drops >3% in N days} / N -> P(daily S&P drops >3%) as N -> Infinity - Axioms and properties of probability, in particular: * Bayes rule * marginalization * Application: A drug test is 99% accurate, so if someone tests positive, we're pretty sure he/she uses drugs, right? - Weird stories, installment 2,... ---------------------------------------------------------------- ANNOUNCEMENTS: - HW 4 is due today 5pm. - Quiz 4 has been moved from Wed to Mon next week. ---------------- RECAP LECTURE 17: - Probability * concepts: sample space = set of possible random outcomes random event = any subset of sample space * P(event) = lim #{A occurs in N observations}/N as N->infinity * axioms: something like weight with total=1 . P(A) non-negative . P(Omega) is one . P(A or B) = P(A) + P(B) if A and B are disjoint * derived properties: . complement rule . monotonicity rule . general addition rule . summation rule for any number of disjoint events - Conditional probability * definition: P(A|B) = P(A and B) / P(B) * trivial property: P(B|B) = 1 - Weird stories: * late mail * traffic accidents * rings lost/found * lightnings ---------------- ROADMAP LECTURE 18: - Ramifications of conditional probability: * general product rule * marginalization rule * applications: . genes and disease . drug testing - Notion of independent events - Random variables: random numeric outcome (random event: random binary outcome, yes or no) - Expected values of random variables: probability-weighted average of a random variable - Weird stories: * more lightnings * long-lost uncle Bill ---------------------------------------------------------------- ANNOUNCEMENTS: This coming Monday, March 26: - Quiz 4 - HW 5 to be issued - Solutions for Retake of Quiz 3 are posted - so are histograms of all Quizzes, HWs, Midterm - There are some nameless Quiz 3 Retakes. See the instructor ---------------- RECAP LECTURE 18: - Conditional probability: * general product rule: P(A and B) = P(A|B)*P(B) * marginalization rule: P(A) = P(A|B)*P(B) + P(A|not B)*P(not B) * Bayes rule (a combination of the above) * applications: . rare genes and rare diseases . drug testing * Important concepts of "false positives" and "false negatives". - Independent events: Knowing that B occurred, does not change my opinion about the probability of A. P(A|B) = P(A) The product formulas hold ONLY for independent events P(A and B) = P(A)*P(B) - Weirdness: lightnings ---------------- ROADMAP LECTURE 19: - Random Variables: random number outcomes Initially: discrete random variables - Expected values = population means - Population variances and standard deviations - Continuous random variables - The normal distribution or the "bellcurve" ---------------------------------------------------------------- ANNOUNCEMENTS: - Quiz 4 solutions will be posted tomorrow. - HW 5 is on webCafe: Theme: the Central Limit Theorem - There is one nameless Quiz 3 Retake. See the instructor. ---------------- RECAP LECTURE 19: - Random Variables: random number outcomes Initially: discrete random variables - Expected values = population means (Location) . Probability weighted means . Limit of the mean of a column as N -> Infinity . Essential properties: Linearity: E(aX+bY) = aE(X)+bE(Y) (think: portfolios) . Other properties not yet mentioned: Monotonicity: X >=0 => E(X) >=0 (higher triviality) Constants: E(c) = c (higher triviality) . Properties of a location parameter follow: E(X+c) = E(X)+c; E(cX) = cE(X) - Population variances and standard deviations (Dispersion) . Pop. Variance = expected value of 'squared deviation from the mean' . Pop. SD = root of Pop. Variance . Notation: sigma^2 and sigma, resp. . Essential properties: dispersion sigma(X+c) = sigma(X) sigma(cX) = |c|sigma(X) - Weird stories: 'Long lost Uncle Bill' (separation/reunion stories) ---------------- ROADMAP LECTURE 20: - Population covariance and correlation - Continuous random variables: Random number outcomes for dollars, percents, miles, years,... - The bellcurve: 'The interval [mean+-2SD] catches 19 out of 20 observations.' - The root-N law: Module 7 Why doubling the sample size does not cut the error by 50%. ---------------------------------------------------------------- ANNOUNCEMENTS: - The first posting of the solutions of Quiz 4, Section 2 had a mistake. Correct: "relative frequency of survivors among females" - HW 5: due Tue, April 3, 5pm (moved from the day before) - Quiz 5: Wed, April 4 - The JMP saga... Note to instructor: regrade of Section 2 Quiz 4 !!!!!!!!!!!!!!!!!!!!!!!!! - Shameless advertisement: PRI's "Fair Game" with Faith Salie Intelligent, satirical hour about daily events and the arts. . Neil de Grasse Tyson on "Death by Black Hole" . "Born Ruffians" from Toronto . Wharton's Justin Wolfers on "prediction markets" (past Monday) ---------------- RECAP LECTURE 20: - Population covariance and correlation Context: discussion of population quantities obtained from sample quantities for N -> Infinity - Continuous random variables: . Described by density functions . Weirdness: any given value has probability zero, yet one value will be observed... - The bellcurve: density ~ exp(-x^2/2) . "Empirical rule": The interval [mean+-2SD] catches 19 out of 20 observations. The interval [mean+- SD] catches 2 out of 3 observations. . Purvasive in: * nature, and in * statistical inference (coming up: the Central Limit Theorem) - The root-N law: . Scenario: a mind game... Imagine that many datasets had been collected, and imagine the datasets all had the same variables and the same size N. Then the mean of a given variable takes on slightly different values across the datasets. Q: Can we measure the variability of the mean across datasets? A: Yes, it is V(mean) = V(variable) / N. - Weird stories: 'freaky falls' ---------------- ROADMAP LECTURE 21: - More about the root-N law - The Central Limit Theorem: Not only will the mean have a shrinking variance as N -> Infinity, it will also be ever more normally distributed. ---------------------------------------------------------------- UPFRONT: Steven Pinker on violence: http://www.edge.org/3rd_culture/pinker07/pinker07_index.html ---------------- ANNOUNCEMENTS: reminders - HW 5: Thursday, April 5, 5pm - Quiz 5: this Wed, April 4 ---------------- RECAP LECTURE 21: - Means vary from dataset to dataset - Root-N law: population SD(mean) = population SD(observations) / root-N - Standard error = population SD(mean) - Estimation: estimate the unknown sigma = population SD(observations) with s = sample SD(observations) - Standard error ESTIMATE of the mean: sample SD(observations) / root-N - Weirdness: We are learning from a single dataset something about the variability of means across datasets... - Weird stories: Sheer coincidences ---------------- ROADMAP LECTURE 22: - Central Limit Theorem: Variability of means across datasets is ever more normal the larger the dataset size N is. - Apply the empirical rule to distribution of the means across datasets - Confidence intervals: "mean +- 2 stderr" - Confidence intervals are random but have a 95% chance to catch the population mean. - Weird stories: ironic and tragic exits ---------------------------------------------------------------- REMINDERS: - Midterm 2 on Mon, April 9, 6-8pm see webCafe for room assignments - HW 5 due tomorrow Thu, 5pm - Instructor's office hours tomorrow Thu noon-1:30pm instead of Friday (this week only). Alex Braunstein's office hours are tomorrow 1-3pm. - Alternate date for Midterm 2: Tue, April 10, 3pm (Stat Dept) send email to the instructor to certify either religious observation or conflict with another important class (be specific). ---------------- RECAP LECTURE 22: - general theme: variation of means across datasets standard error is a standard deviation in a special case: when the variables are means across datasets standard error = sigma(means) = sigma(observations)/root-N ^^^ root-N law standard error estimate = s(observations)/root-N - Central Limit Theorem (CLT): the variation is approximately normal for N>=50 never mind the distribution of the observations from which the means are formed - The normal approximation gets the better the larger N is. - For illustration see HW 5: You examined the variation of 50 means from observations of . coin tosses (N=30 and N=3000) . a skewed distribution (N=100) - Three logically equivalent statements derived from the CLT: . P( | mean - mu | < 2stderr ) ~ 0.95 . P( mean - 2stderr < mu < mean + 2stderr ) ~ 0.95 . P( mu - 2stderr < mean < mu + 2stderr ) ~ 0.95 The middle statement describes the "coverage probability" of a confidence interval: CI := (mean - 2stderr, mean + 2stderr) The middle statement one more time: . P( mu is in the CI ) ~ 0.95 ---------------- ROADMAP LECTURE 23: - the trade-off between precision and uncertainty - the special case of proportions/relative frequencies with an application to electioneering - probability versus evidence: . population means mu assign probabilities to sample means . sample means assign evidence to population means ---------------------------------------------------------------- REMINDER: Midterm 2 tonight 6-8pm ANNOUNCEMENT: last homework is replaced with another Quiz (6) ---------------- RECAP: second half of this lecture -- Q&A for Midterm 2 ---------------- ROADMAP LECTURE 24: - Statistical Testing - Null hypotheses - Test statistics - Significance levels - P-values ---------------------------------------------------------------- REMINDER: Quiz 6, Monday, April 16 instead of a Homework 6 ---------------- RECAP LECTURE 24: Statistical Testing - statistical tests <-> mu-centered confidence intervals <-> Xbar-centered - null hypotheses - test statistics - null distribution - significance levels - rejection regions ---------------- ROADMAP LECTURE 25: - have another go at Statistical Testing - weird stories: pre-occupations of the mind, a scary story in Florida, and a kitty story in Saratoga Next week we'll discuss how to think about weirdness, rare events... ---------------------------------------------------------------- ANNOUNCEMENTS: - Midterm 2 results will be posted starting Wednesday. - Wednesday is the last class. - Final Exam: Thu, May 3, 6-8pm - Office hours will be scheduled the days before the final exam. ---------------- RECAP LECTURE 25: - testing statistical hypotheses - null hypotheses: mu = mu0 - test statistics: |t| = distance(Xbar, mu0) in units of stderr - rejection: |t|>2 - significance level: P(rejection | null hypothesis) ~ 0.05 - p-value: achieved significance level a measure of evidence in favor of the null hypothesis ---------------- ROADMAP LECTURE 26: - testing mean differences between TWO GROUPS - standard null hypothesis: mu1 = mu2 - vastly more important than testing specific values of mu - recipe: find a standard error for Xbar1-Xbar2 - test statistic t = (Xbar1-Xbar2)/stderr - examples from the Penn Student Survey: comparison of the sexes ---------------------------------------------------------------- RECAP LECTURE 26: see ROADMAP LECTURE 26 ---------------- ROADMAP LECTURE 27: MODULE 10 - goal: testing slopes in simple linear regression - definition of population parameters: . define the simple linear regression model (SLRM) . consider datasets as generated from the model => the model parameters (slope, i'cept, SD off the line) are the population parameters - illustrate the SLRM with simulations: "forging diamond data" - find a standard error and standard error estimate for the slope - play statistical inference for the slope: . test H0: population slope = 0 (why this null hypothesis?) . form CIs for the population slope - two examples analyzed in JMP . the Diamond data . the Penn Students survey: are older students taller? - weirdness: a discussion ----------------------------------------------------------------