17 Comparison

As in previous chapters in this part of the book, we again use R in this chapter as a powerful calculator. You can take a different approach, however. R includes a function that does most of the work, t.test, and avoids making that extra assumption of equal variances. The examples here illustrate that function as well as show how to do the other calculations from scratch, for practice. Functions or new uses of functions that are introduced in this chapter are few:

  • boxplot draws side-by-side comparisons of groups
  • t.test computes two-sample t-statistics, avoiding the assumption of equal variances

17.1 Analytics in R: A/B Testing

The underlying data are in a file that can be used to obtain a contingency table.

Test <- read.csv("Data/17_4m_abtesting.csv")
dim(Test)
## [1] 2495    3

Each row summarizes the behavior of a visitor to the web site. The first column is the ID number of the visitor, and the other two columns give the page viewed and whether the visitor added the item to the shopping cart.

head(Test)
##   Visitor Page_Viewed Add_To_Cart
## 1  892360           B          No
## 2  154666           A          No
## 3  100035           A          No
## 4  904653           B          No
## 5  796311           B          No
## 6  864892           B          No

We can get the proportions from the contingency table. CrossTable (in the package gmodels introduced in Chapter 5) provides a good summary.

require(gmodels)
CrossTable(Test$Add_To_Cart,Test$Page_Viewed, 
           prop.t=FALSE, prop.r=FALSE, prop.chisq=FALSE)  # suppress extraneous terms
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2495 
## 
##  
##                  | Test$Page_Viewed 
## Test$Add_To_Cart |         A |         B | Row Total | 
## -----------------|-----------|-----------|-----------|
##               No |      1192 |      1198 |      2390 | 
##                  |     0.975 |     0.942 |           | 
## -----------------|-----------|-----------|-----------|
##              Yes |        31 |        74 |       105 | 
##                  |     0.025 |     0.058 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |      1223 |      1272 |      2495 | 
##                  |     0.490 |     0.510 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

Define the needed sample statistics so that we can use expressions like those in the textbook for the confidence interval.

n_a <- 1223
n_b <- 1272

p_a <- 31/n_a
p_b <- 74/n_b

zquan <- -qnorm(0.025)   # approximately 1.96

The expression for the standard error is on the long side, but shared for both endpoints of the confidence interval.

(p_b - p_a) + c(-zquan,zquan) * sqrt( p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b )
## [1] 0.01723788 0.04841931

The interval does not include 0 and so we reject \(H_0\) that the two pages are equally effective. Page B has the higher conversion rate.

17.2 Analytics in R: Comparing Two Diets

Start by reading in the data.

Diet <- read.csv("Data/17_4m_diet.csv")
dim(Diet)
## [1] 63  6

The analysis concerns the amount of weight lost by the 63 participants after six months on the diet.

head(Diet)
##     Diet Initial_Weight Weight_6_Month Weight_12_Month Loss_6_Months
## 1 Atkins            310          292.7           286.1          17.3
## 2 Atkins            309          275.1           306.3          33.9
## 3 Atkins            257          217.7           263.3          39.3
## 4 Atkins            227          221.1           216.8           5.9
## 5 Atkins            231          204.5           211.8          26.5
## 6 Atkins            195          148.0           174.5          47.0
##   Loss_12_Months
## 1           23.9
## 2            2.7
## 3           -6.3
## 4           10.2
## 5           19.2
## 6           20.5

The function boxplot can draw boxplots side-by-side to compare data in two or more groups. We can use the formula notation introduced in Chapter 6 to define the variables in the plot. The variable to the left of the ~ defines the y-axis and the variable to the right of ~ defines the x-axis.

boxplot(Loss_6_Months ~ Diet, data=Diet)

The boxplots appear reasonably symmetric, but these are small samples. We need to check the kurtosis of the two groups. Use tapply (illustrated in Chapter 14). The sample estimates of \(K_4\) are small enough so that we meet the sample size condition in both samples.

require(moments)
tapply(Diet$Loss_6_Months, Diet$Diet, kurtosis) - 3
##       Atkins Conventional 
##  -0.09135406  -0.66857151

Now compute the relevant summary statistics for each group using tapply again.

n    <- tapply(Diet$Loss_6_Months, Diet$Diet, length)
xbar <- tapply(Diet$Loss_6_Months, Diet$Diet, mean)
s    <- tapply(Diet$Loss_6_Months, Diet$Diet, sd)

For example, xbar is a two-element vector with the mean of each group.

xbar
##       Atkins Conventional 
##    15.424242     7.006667

These sample statistics determine the t-statistic. Don’t forget to subtract off the break-even value 5 from the difference between the means.

t_stat <- (xbar[1]-xbar[2] - 5)/sqrt(s[1]^2/n[1] + s[2]^2/n[2])
t_stat
##   Atkins 
## 1.014365

Assuming equal variances in the two populations (see the discussion in the textbook), we can use pt to find the p-value.

1 - pt(t_stat, df=sum(n)-2)
##    Atkins 
## 0.1572075

To get the more sophisticated version of the t-statistic that does not require the data come from populations with equal variances, use the function t.test. You supply the data to this function as two separate vectors. In this case, use the values in the column Diet to identify observations in the “Atkins” and “Conventional” groups. The test statistic is the same, but the p-value is substantially larger.

group_a <- Diet$Loss_6_Months[Diet$Diet=="Atkins"]
group_c <- Diet$Loss_6_Months[Diet$Diet=="Conventional"]

t.test(group_a, group_c, mu=5)
## 
##  Welch Two Sample t-test
## 
## data:  group_a and group_c
## t = 1.0144, df = 60.826, p-value = 0.3144
## alternative hypothesis: true difference in means is not equal to 5
## 95 percent confidence interval:
##   1.68010 15.15505
## sample estimates:
## mean of x mean of y 
## 15.424242  7.006667

17.3 Analytics in R: Evaluating a Promotion

Start by reading the data file.

Promo <- read.csv("Data/17_4m_promo.csv")
dim(Promo)
## [1] 125   2

Each row describes the awareness and number of mailings used by one of the 125 visited offices.

head(Promo)
##   Awareness Mailings
## 1        NO       15
## 2        NO       49
## 3        NO       42
## 4        NO        0
## 5        NO       26
## 6        NO       35

The boxplots look similar and symmetric.

boxplot(Mailings ~ Awareness, data=Promo)

The estimated excess kurtosis in the two samples confirms that we have enough data to meet the sample size condition.

require(moments)
tapply(Promo$Mailings, Promo$Awareness, kurtosis) - 3
##         NO        YES 
## -0.7326079 -0.9457113

Use tapply as in the prior example to find the sample statistics for each group. Notice that the first element in each vector describes the “No” group (those that were not aware of the promotion prior to the visit).

n    <- tapply(Promo$Mailings, Promo$Awareness, length)
xbar <- tapply(Promo$Mailings, Promo$Awareness, mean)
s    <- tapply(Promo$Mailings, Promo$Awareness, sd)

We can compute the test statistic by almost copying the same formula used in the prior example.

t_stat <- (xbar[1]-xbar[2])/sqrt(s[1]^2/n[1] + s[2]^2/n[2])
t_stat
##        NO 
## -2.886651

You can guess now that the confidence interval won’t include zero. Let’s check that. (Again, I am assuming variances match in the two populations to simplify finding the degrees of freedom for the t-quantile. The two intervals are very similar, particularly after rounding.)

The confidence interval for \(\mu_{yes}-\mu_{no}\) is then

t_quant <- -qt(0.025, df=sum(n)-2)

ci <- (xbar[2]-xbar[1]) + c(-t_quant, t_quant) * sqrt(s[1]^2/n[1] + s[2]^2/n[2])
ci
## [1]  3.867722 20.745611
round(ci,1)
## [1]  3.9 20.7

To avoid the details, use the function t.test. You just have to collect the data for the two groups into separate samples. In this example, the results are slightly more significant than found using the just-illustrated procedure that assumes equal variances. (Be careful typing the names of the groups; R is case sensitive.)

group_y <- Promo$Mailings[Promo$Awareness=="YES"]
group_n <- Promo$Mailings[Promo$Awareness=="NO"]

t.test(group_y, group_n)
## 
##  Welch Two Sample t-test
## 
## data:  group_y and group_n
## t = 2.8867, df = 85.166, p-value = 0.004933
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.830319 20.783014
## sample estimates:
## mean of x mean of y 
##  42.00000  29.69333

17.4 Analytics in R: Sales Force Comparison

Read the data into a data frame.

Sales <- read.csv("Data/17_4m_sales_force.csv")
dim(Sales)
## [1] 20  4

Each row describes the sales of the two groups within a district.

head(Sales)
##   District Group_A Group_B Difference
## 1        1     370     428        -58
## 2        2     396     430        -34
## 3        3     390     369         21
## 4        4     372     385        -13
## 5        5     210     239        -29
## 6        6     415     418         -3

Because of the pairing, the sales obtained by the two groups are highly dependent.

plot(Group_A ~ Group_B, data=Sales)

We take advantage of this pairing by comparing the two sales groups within each sales district and work with the differences. The differences in the data frame are formed as sales of group A minus sales of group B.

hist(Sales$Difference)

This is a small sample of districts, but the kurtosis is small and we have enough to meet the sample size condition.

require(moments)
kurtosis(Sales$Difference)-3
## [1] -0.7059701

All we have left is to compute the needed statistics and either form a one-sample test or confidence interval for the average of the differences.

n <- length(Sales$Difference)

xbar <- mean(Sales$Difference)
s    <- sd(Sales$Difference)

t_quant <-qt(0.025, df=n-1)

The confidence intervals for \(\mu_A - \mu_B\) is

ci <- xbar + c(-t_quant,t_quant) * s/sqrt(n)
ci
## [1]  -0.9818521 -26.0181479

Since both endpoints are negative, we conclude \(\mu_B\) is statistically significantly larger.