#================================================================ LECTURE 15: * ORG: - HW 6, writing exercise, due Wed in class: hand in printed version in class AND email attachment as usual. - HW 7 to be posted: Real estate project, an exercise in interpretation and communication, NOT model selection (the model will mostly be dictated). * RECAP: - Purpose of logarithms in statistical modeling? - Characterization of ratio scales? - How are factor changes expressed in business (and elsewhere)? - What factor change does a yearly growth rate of 146% imply? - If a predictor variable is log-transformed, how do you interpret the slope? - If a response variable is log-transformed, how does it affect the interpretation of slopes? - What is elasticity? - Traps: . Causality: "When x1 is increased by one unit,..." ==> In observational data we can''t "increase" x. Express in terms of cases that "differ by ... units in x". . Observations versus averages: "... y increases by..." ==> (1) Express in "DIFFERENCES", not "changes" and "increases". (2) Express association with y in "AVERAGE differences". . Estimates vs parameters: "the slope is ..." ==> "The ESTIMATED slope is ..." - Interpretations of regression estimates: . bj = Est. ave. difference in y for a unit difference in xj when all other predictors are the same. = Est. ave. difference in y for pairs of cases that ONLY differ by a unit in xj. [ "Unit change in xj" and "holding all other predictors fixed" are formulations often used but misleading because of their implication of active experimentation. ] - Other regression quantities: . Interpretation of b0? "est. ave. y at x=0" . Interpretation of stderr(b1)? "est. stdev of b1 varying from dataset to dataset" [ caveat: conditioning on x ] . Interpretation of t1? "t1 = b1/stderr(b1) = dist of b1 from 0 in multiples of stderrs" . Interpretation of p-value(b1)? "prob of observing a t1 more extreme than the observed value, dataset to dataset, under the assumption beta1=0" . Interpretation of s=RMSE? (R''s "Resid. stderr" is a bad term) "est. sdev of errors" . Interpretation of R Square? "variation of y accounted for by the model" - Ratio scales versus interval scales: . Interval scales: + Values between -Inf and +Inf + Differences/changes expressed in terms of additive amounts . Ratio scales: + Values > 0 + Differences/changes expressed in terms of multiplicative factors ACTUALLY: ... in terms of percents . Purpose of logs: map interval scales to ratio scales - Use of logarithms in linear models: ... simplified ... 0) y = b0 + b1 * x + eps LINEAR ASSOCIATION constant difference in x ~ constant difference in yhat 1) log(y) = b0 + b1 * x + eps EXPONENTIAL GROWTH/DECAY/DEPRECIATION constant difference in x ~ fractional (%) difference in yhat 2) y = b0 + b1 * log(x) + eps DIMINISHING RETURNS (for b1>0) fractional (%) difference in x ~ constant difference in yhat 3) log(y) = b0 + b1 * log(x) + eps CONSTANT ELASTICITY fractional (%) difference in x ~ fractional (%) difference in yhat Make choice of log based on ratio scale interpretation. * ROADMAP: - A protocol for reporting regression results - More MBA Stats: . Total cost vs average cost, and models thereof . Fixed costs vs variable costs and their models in models of the above ================================================================ * A PROTOCOL FOR REPORTING REGRESSION FINDINGS: - Target audience: non-statisticians (your manager...) . Principles: avoid technical terms, report relevant subject matter, keep stats and its qualifications at low volume, except when bearing bad news ... - Example: my.cars <- cars[,c("Weight.lbs","Horsepower","MPG.Highway")] - Explain purpose and data: "The goal is to analyze gas mileage of current car models and describe to what extent it is associated with two major model characteristics: weight and horsepower." - Summarize the predictors and response: summary(my.cars) "Current car models average about 3,500 lb, 200 HP, and 26 MPG on the highway. Some models are as light as 1,850 lb and as heavy as 6,400 lb. The engines range from 67 to 660 HP, and gas mileage from 12 to 68 MPG." - Pick an imagined case with heavily rounded mean or median values as predictors. Example: apply(my.cars, 2, mean, na.rm=T) ==> "For reference, consider hypothetical car models roughly in the middle of the pack, with weight 3,500 lb and 200 HP." - Fit a model: my.cars.lm <- lm(MPG.Highway ~ Weight.lbs + Horsepower, data=my.cars) and give a reference range (PI) of the response values for the reference models: my.cars.refy <- predict(my.cars.lm, newdata=data.frame(Weight.lbs=3.5, Horsepower=200, MPG.Highway=NA)) my.cars.refy # Closer to 27 than 26 because our reference values are low-balling the means. # Standard Gaussian PI: my.cars.refy + c(-2,2) * summary(my.cars.lm)$sigma # Quantile-based non-parametric PI: my.cars.refy + quantile(resid(my.cars.lm), prob=c(.025,.975)) # (Better: use standardized residuals) ==> "Such reference car models are predicted to range in mileage from as low as 20 MPG to as high as 32 MPG. Thus there is considerable variation in highway mileage even among models with the same weight and same horsepower." - Continue with general comments on the quality of the data and results: summary(my.cars.lm) ==> "This is compatible with the fact that weight and horsepower account for about two thirds (64%) of the variation in MPG. (Weight and horsepower are highly statistically significant as predictors of MPG, but, as is often the case, this does not translate to an equal degree of practical significance.)" - Translate the slopes as meaningfully as possible: "If we compare models that differ by 500 lb in weight but have the same horsepower, the estimated average difference in mileage is -2 MPG, quantifying the obvious fact that heavier cars have worse mileage." "If, on the other hand, we compare models that differ by 10 HP but have the same weight, the estimated avarage difference in mileage is -0.3 MPG, again quantifying the obvious fact that stronger cars have worse mileage." - Mention parenthetically the uncertainty in the slopes: "(The 95% uncertainty about these numbers can be described by a range -2+-.7 MPG for the 500 lb weight difference and -.3+-.06 MPG for the 10 HP difference.)" - Complications: + Predictors can be ratio scale, hence you may have used log(x), hence you need to consider a percent difference in x. + Any other kind of transformation also causes problems. E.g., physics says to model "I(1/MPG.Highway)", which would require additional non-trivial translations of statement about differences, possibly replacing means with medians, and translating endpoints of ranges. ================================================================ * FIXED COSTS/VARIABLE COSTS: - Example: A manufacturer of 'blocks' gets orders of certain quantities and production costs. It is clear that total production cost is mostly driven by quantity. But the manufacturer would like to know what other factors make some orders more and others less expensive. A simple approach would be lm(Tot.Cost ~ Units + other factors) ==> Tot.Cost ~=~ b0 + b1*Units + ... + error Interpretation: - b0: FIXED COST for any order (usually setup cost) - b1: VARIABLE COST, or average production cost per unit - The error is constant for all sizes of orders. (Unrealistic?) One could argue, though, that whatever factors there are, they are more likely to act at the per-unit level: amount of material per block, difficulty of material, extra features ordered for the blocks, ... Their cost effects might multiply up with the number of units ordered. This suggests modeling the Average Cost, Ave.Cost = Tot.Cost / Units as opposed to the Total Cost: lm(Ave.Cost ~ other factors) Total cost would then be obtained as Tot.Cost = Ave.Cost * Units But there is still a problem: The model might not contain all the factors that make up unit cost. For one thing, we should have a way to have fixed cost in the model as well. This seems like a drawback of choosing Average Cost as response. The problem can be fixed, though: introduce an additional predictor X = 1/Units. See what happens: lm(Ave.Cost ~ 1/Units + other factors) ==> Ave.Cost ~ b0 + b1/Units + ... + error A prediction equation for Tot.Cost is obtained by multiplying up with Units: ==> Tot.Cost ~ b0*Units + b1 + ...*Units + error*Units Interpretation: - b0: VARIABLE COST - b1: FIXED COST - The error increases with the size of an order. # Hence both models have interpretations of b0 and b1 as fixed and variable costs, in reverse order. But they differ in their error structure, and in how other factors affect the Total Cost of the order. # # A data example: Cost is given as Ave.Cost. blocks <- read.table("Data/blocks_red.dat", header=T) colnames(blocks) plot(Ave.Cost ~ Units, data=blocks, pch=16) # Maybe there is a slightly negative slope: lower Ave.Cost # for more Units? Economies of scale? # As an aside, here is Tot.Cost, in $1000: plot(Ave.Cost*Units/1000 ~ Units, data=blocks, pch=16) # Certainly doesn't look like a constant error variance! # # There is a problem, though: an extremely small order # visible when plotting against 1/Units: plot(x=1/blocks[,"Units"], y=blocks[,"Ave.Cost"], pch=16) # Remove: sel <- 1/blocks[,"Units"] < .5 plot(x=1/blocks[sel,"Units"], y=blocks[sel,"Ave.Cost"], pch=16) # # Naive model for Tot.Cost=Ave.Cost*Units: summary(lm(Ave.Cost*Units ~ Units, blocks[sel,])) Estimate Std. Error t value Pr(>|t|) (Intercept) 4146.470 865.000 4.794 3.22e-06 *** Units 22.149 1.658 13.356 < 2e-16 *** # Model for Ave.Cost: summary(lm(Ave.Cost ~ I(1/Units), blocks[sel,])) Estimate Std. Error t value Pr(>|t|) (Intercept) 31.082 2.273 13.673 < 2e-16 *** I(1/Units) 1502.211 416.821 3.604 0.000397 *** # Typically Tot.Cost models have much higher R^2 # than Ave.Cost models, but this is irrelevant: # The high R^2 stems from the trivial proportionality with Units, # which the Ave.Cost model factors out. # Often, Ave.Cost models are more plausible and # yield comparable predictions when multiplied up to Tot.Cost. # # Btw, the estimates of fixed and variable costs # differ wildly between the two approaches. # Which one would you believe? * Homework 7: Commercial real estate rents, with the variables listed below. Translation: Tot.Cost = 'RentTotal', Units = 'SqftLease'. Practice for you: - Which variables should be considered as fixed cost, and which as variable cost? - Write down for yourself a Tot Cost model and an Ave Cost model. In the analysis use a Ave Cost model. As a criterion for distinguishing fixed and variable cost predictors, keep the following in mind: - fixed cost predictors affect the rent in terms of a fixed $ amount, no matter what the size of the property is; - variable cost predictors scale up with the size of the property; they affect the rent in terms of percentages of the total rent. Another way to distinguish them: Does a change in the predictor affect the rent in terms - of the whole property or - of a square foot of the property? With these considerations in mind, how would you introduce the following variables in the two equations: RentTotal = b0 + b1*SqftLease + b2*... RentTotal/SqftLease = a0 + a1*1/SqftLease + a2*... VARIABLE DESCRIPTION RentTotal Total annual rent of the lease SqftLease Size of the lease in square feet FirmType Majority type of firms in the building (doctors, legal, business, government, other) Age Age of the building in years Renovation Number of years since last renovation Wiring Yes, if building has new wiring Occupancy Fraction of offices that are rented Leaselength Length of the lease in years Renewable Yes, if the lease is renewable Location One of three locations: center of city, old/new suburb DistCity Distance to the center of the city center in miles DistAirp Distance to the airport in miles DriveAirp Distance to the airport in driving time DistHosp Distance to nearest hospital in miles FloorLease The (lowest) floor where the lease is located FloorsBldg Number of floors in the building SqftFloor Size of a floor in square feet Elevator Number of elevators Parking Number of executive parking spaces included in rental Restaurant Yes, if the building has a restaurant Exercise Yes, if building has a health club What type seems to be the majority, fixed or variable cost factors? - Homework 7: . NOT an exercise in model building . Focus on explaining the model suggested in the problem statement. . Do not throw in all predictors, only those needed to answer the questions raised in the problem statement. . You can add some suggestions for further analysis in the technical section, but you do not need to follow up on them. . Again, this is largely a communications exercise: Are you able to convey the story contained in a complex regression equation to a non-statistical audience of some importance? . Use scenario and business language: "for ... we expect an average price per sqft of..." "for 95% of properties with these characteristics ..." "... would be overpriced" "... would be a good deal or indicate something is wrong" "for ... we expect an average premium/discount of ..." . Look out for a counter-intuitive outlier. ================================================================