================================================================


* SIMULTANEOUS INFERENCE IN LINEAR MODELS: MAX-t AND SCHEFFE

           (Background: Ioannidis, irreproducible research; Simmons, Nelson, Simonsohn)
            on our website:
              site <- "http://www-stat.wharton.upenn.edu/~buja/STAT-961"
                "Ioannidis-PLOS-2005-Why Most Published Research Findings Are False-annotated.pdf"
                "Simmons_Nelson_Simonsohn_FalsePositivePsychology-2011.pdf"
           )

  - Problems:

    . The significance levels and confidence guarantees are one
      parameter at a time.

    . By construction of 5\% significance tests, 5 out of 100 true nulls
      will be erroneously rejected.

    . By construction of 95\% confidence intervals, 5 out of 100 will
      miss the true parameter.

    . These days tests and CIs are sometimes produced by the 100,000s.
      Ex.: Testing whether genetic copy number variations (CNV) or
      single nucleotide polymorphisms (SNP) are linked to specific
      diseases or disabilities (cancers, autism, ...).

    ==> Problems of simultaneous/multiple inference


  - Common approaches to the problem are based on two fundamental concepts:
    . family-wise error rates (FWER)
    . false discovery rates (FDR)

    We will apply the notion of FWER to linear models.
    FWER = simultaneous Type I error probability

    FDR is more appropriate in situations of many independent tests;
    it is harder to apply in regression where tests are dependent.
    (Exception: designed experiments with orthogonal predictors.)


  - Meanings of FWER:

    . Testing: The probability (rate) of making ANY false rejection
      among a number of tests is (no more than) alpha (=FWER).

    . Confidence: The probability (rate) of catching ALL true parameter
      values among a number of CIs is (at least) 1-alpha (=1-FWER).


  - We will consider FWER control based on the following approaches:
    . Bonferroni
    . exact control under independence
    . max-t / min-pval simultaneous inference
    . Scheffe simultaneous inference


  -  Bonferroni: a rough-and-ready rule

    . Example: Consider a regression with p predictors.
      Assume that we want our inferences to be simultaneouly controlled
      at 5\% for all p t-statistics.

    . In general:
      We would like a significance level alpha (=0.05) SIMULTANEOUSLY
      for false rejection in any among q tests.

      H0.sim:     beta1=beta1.0 & ... & betaq = betaq.0
                  ^^^^^^^^^^^^^         ^^^^^^^^^^^^^^^
                     H0.1       & ... &     H0.q     # H = null hypotheses, not hat matrices!

      Test statistics:    tj = (bj - betaj.0) / SE.hat(bj)      (as usual)


      FWER  = P[ any false rejection j=1,...,q           | H0.sim ]
            = P[ |t1| > c or ... or |tq| > c             | H0.sim ]
           <= P[ |t1| > c | H0.sim ]  + ... +  P[ |tq| > c | H0.sim ]
           <= q * max_{j=1..q}  P[ |tj| > c | H0.j ]     (******************) see below
            = q *     alpha.ind             (alpha.marg = marginal/one-at-a-time/individual Type I Error)

            PS: The step to (*****************) requires some thought,
                one of the subtleties in this area.
                Recall: For any betaj, the other betak are nuisance parameters.
                The step to (*****************) should be considered an inequality, not an equality,
                and not just for the reason that there is a 'max_{j=1..q}'.
                There is also the difference between  H0.sim  and  H0.j !
                Both are composite hypotheses, but  H0.sim  is a subset of  H0.j,
                because  H0.sim  specifies the other parameters to have specific values,
                whereas  H0.j  does not.
                Therefore, one should read
                     P[ |tj| > c | H0.j ]
                as follows:
                     P[ |tj| > c | H0.j ]  =
                       max_{betak.0 arbitrary for all k (!= j)}
                           P[ |tj| > c | betaj=betaj.0, betak=betak.0 for k != j ]
                because the test of betaj should be at level alpha for any setting of betak (k != j).
                However, one also should read
                     P[ |tj| > c | H0.sim ]
                 as follows:
                     P[ |tj| > c | H0.sim ]  =
                     P[ |tj| > c | betaj=betaj.0, betak=betak.0 for k != j ]
                Thus we have
                     P[ |tj| > c | H0.sim ]  <=  P[ |tj| > c | H0.j ]      (******************)

                The pivotal nature of the t-tests reliefs us of this
                concern: They are level alpha for all nuisance settings,
                hence we have equality in (******************).

                [Being careless about the nature of composite null
                 hypotheses can lead to thinking errors and incorrect
                 inference methods.]

                [Further clarifications:
                 - A model is set of dataset distributions.
                 - A parametric model is a set of dataset distributions parametrized by
                   a finite-dimensional vector such as (beta0,beta1,...,betap,sigma^2).
                 - Example: normal linear model
                    M = { P(dy1,dy2,...,dyN): P = N(sum_{j=0...p} xj betaj, sigma^2 I_N) }
                 - A null hypothesis is a subset of the model M, usually given by a restriction
                      on the model parameters:
                    H0.j = { P: betaj=0 }
                         = { P(dy1,dy2,...,dyN): P = N(sum_{k=0...p,!=j} xk betak, sigma^2 I_N) }
                    H0.sim = {P: beta1=beta2=...=betap=0 }
                           = { P(dy1,dy2,...,dyN): P = N(x0 betak0, sigma^2 I_N) }  (x0=e=(1,...,1))
                           = H0.1 intersect H0.2 intersect ... intersect H0.p
                 - Type I error control really means controlling
                           max_{P in H0.j} P[|tj| > c] < alpha
                 - The reason for not needing ' max_{P in H0.j}' in regression t-tests
                   is that the t-statistic is pivotal:
                     its null distribution does not depend on the nuisance parameters,
                   or, equivalently,       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                     the rejection probabilities P[|tj| > c] are the same for all P in H0.j !!
                 ]                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      Consequence: If we determine the cut-off 'c' in individual tests such that

            ---------------------------------
           |                                 |
           |     alpha.ind  =  alpha / q     |   Bonferroni rule:
           |                                 |   universally conservative
            ---------------------------------

      then we control the FWER CONSERVATIVELY at level alpha.

    . Bonferroni rule: Test individually/marginally at level alpha/q,
      and you will test family-wise at level alpha.

    . Q: When is the Bonferroni rule too conservative?
      A: ...
      Ex.: Drastic, but makes the point...
           Assume the q tests are identical.
           I.e., when they reject/retain together 100\% of times.
           ==> They are really just one test masquerading as q tests.
           Hence:
                  P[ false rejection in all tests ]
                = P[ false rejection in one test ]
                = alpha
           No adjustment for simultaneity is needed.

      Generalizing: The more the tests are correlated,
      the less adjustment for simultaneity is needed.


  - Exact control under independence:

    . In what follows we consider a situation as in regression when sigma is known
      because we want to analyze the case of independent test statistics.
      Q: Why can''t the t-statistics in a regression be independent?
      A: ...
      ==> Assuming sigma known means working with z- rather than t-stats.
          This is a good approximation to the case when N is VERY large
          compared to p.

    . Assume the z-statistics z1,...,zq are independent.
      In regression this is the case when the predictors X1,...,Xq are orthogonal,
      or if we use orthogonalized predictors: X1, X2.1, X3.12, X4.123, ... Xq.1.2...{q-1}

      1 - FWER  = P[ no false rejection | H0(1) & ... & H0(q) ]
                = P[ |z1| <= c and ... and |zq| <= c | H0(1) & ... & H0(q) ]
                = Product_{j=1...q}  P[ |zj| <= c ]
                = Product_{j=1...q}  (1-alpha)       # If all tests are performed at level alpha.
                = (1-alpha)^q

     Therefore, if we determine 'C' in individual tests such that

            ----------------------------------------------
           |                                              |
           |     P[ |z| > C ] = 1 - (1 - alpha)^{1/q}     |    valid under independence of tests
           |                                              |
            ----------------------------------------------

      then we control the FWER EXACTLY at level alpha.


    . Justification of Bonferroni as a good rule under independence:

                (1 - alpha)^{1/q}  <=  1 - alpha/q    (1st order Taylor at alpha=0, concavity of x^{1/q})
            1 - (1 - alpha)^{1/q}  >=  1 - (1 - alpha/q)  =  alpha/q

            ==>  alpha/q is a conservative per-test level
                 to achieve simultaneous level alpha.

            Note, though: Bonferroni is not exact in this case either;
            it is just a good approximation.


  - Dependent test: The max-|t| approach

    . Applies when the tests are correlated.
      Consider q 2-sided t-tests IN A MULTIPLE REGRESSION.  The FWER is

         P[ any   |tj| > c | H0 ]  =  P[ max_j |tj| > c | H0 ]

      Call  'max_j |tj|'  the 'max-|t|' statistic.
      It is a statistic in its own right.
      The problem is we don''t know its null distribution.
      However, we can easily approximate it by simulation:

    . Simulation recipe for FWER control:

      + Simulate 9,999 values of max-|t|.
        Make sure to use the given predictors X to
        capture the correlations among the tj''s:

          tj = <eps, xj./|xj.|> / s

      + Store these 9,999 values, sort them, and pick the
        9,500''th value as an estimate of the cut-off 'C' for which

             P[ max |tj| > C ] ~=~ 0.05

      PS: The simulation cost depends on p only, not on N:
          One can pre-project all  xj./|xj.|  onto the
          column space using an o.n. basis thereof, and
          thereafter sample rnorm(p+1) instead of  rnorm(N).
          Also, one draw from 's' can be simulated as
               sqrt(rchisq(1,df=N-p-1) / (N-p-1)).
          Due to pivotality wrt sigma you can choose sigma=1 wlog.

    . Executing max-|t| inference in R: 'Boston Housing data'

        # Repeat from Chapter 1: reading the data
          site <- site <- "http://stat.wharton.upenn.edu/~buja/STAT-961"
          boston <- read.table(paste(site,"boston.dat",sep="/"), header=T) # data
        # Size of data and variables:
          dim(boston)                  # 506 census tracts, 14 variables
          colnames(boston)             # you can guess the meanings of some variables
          summary(boston)
          plot(boston, gap=0, pch=".")
        # The last variable, 'MEDV' (Median Housing Value in census tract)
          X <- scale(boston[,1:13])    # just the predictor variables, standardized,
                                       # in order to remove the mean (intercept)
          class(X)                     # a matrix, no longer a dataframe
          N <- nrow(X)
          p <- ncol(X)
        # Matrix of adjusted predictors, scaled to unit length:
          norm <- function(x) x/sqrt(sum(x^2))     # Normalizing to unit vector
          X.adj <- sapply(1:ncol(X), function(j) norm(resid(lm(X[,j] ~ X[,-j]))) )  # (1)
          class(X.adj);  dim(X.adj)               # Matrix of size 506 x 13
          apply(X.adj, 2, function(x) sum(x^2))   # Checking unit lengths...
          apply(X.adj, 2, function(x) sum(x))     # Checking zero means
        # This is actually an inefficient way of going about it.  Here is simpler:
          X0 <- cbind(1,X)
          X.adj <- apply(X0 %*% solve(t(X0) %*% X0)[,-1], 2, norm)                     # (2)
        # (Figure out why (1) and (2) do the same.)
        # ==> The columns of 'X.adj' contain the unit vectors  xj.adj/|xj.adj|
        # All of this was preparation for the simulation.
        # Here is an example of simulating one set of t-statistics:
            t.stats <- crossprod(X.adj, rnorm(N)) / sqrt(rchisq(1,df=N-p-1)/(N-p-1))
            t.stats <- (t(X.adj) %*% rnorm(N)) / sqrt(rchisq(1,df=N-p-1)/(N-p-1))      # same
            t.stats                   # One draw from the joint null distribution of t1,...,t13
            abs(t.stats)              # Throw away the signs.
            max(abs(t.stats))         # Choose the max.
        # This last value is one simulation draw from the null distribution of max(|t1|,...,|t13|).
        # It is the quantity of interest for FWER control.
        # So now we want not one draw but, say, 99,999 draws from this distribution!
        # [We'll say more about why such odd numbers.]
            Nsim <- 99999
        # Here is the loopy way:
            t.max <- rep(NA, Nsim)  # Pre-allocate a long vector for the max-|t| values.
            for(isim in 1:Nsim) {
                t.max[isim] <- max(abs(crossprod(X.adj, rnorm(N)))) / sqrt(rchisq(1,df=N-p-1)/(N-p-1))
                if(isim%%10000==0) cat(isim,"  ")
            }  # Takes a few seconds.
        # Silly: We called rchisq() 99,999 times but could call it just once!
        # Do it over:
            tmp <- rep(NA, Nsim)  # Pre-allocate a long vector for the max-|t| values.
            for(isim in seq(Nsim)) {
                tmp[isim] <- max(abs(crossprod(X.adj, rnorm(N))))
                if(isim%%10000==0) cat(isim,"  ")
            }  # Somewhat faster.  Now for the denominators:
            t.max <- tmp / sqrt(rchisq(Nsim,df=N-p-1)/(N-p-1))
        # Still faster, but not much (removing the cat() calls may help):
            t.max <- sapply(seq(Nsim), function(j) { max(abs(crossprod(X.adj, rnorm(N)))) } ) /
                     sqrt(rchisq(Nsim,df=N-p-1)/(N-p-1))
        # So now we should know a lot about the distribution of max-|t| for this dataset:
            hist(t.max, col="gray", breaks=200)
        # More importantly, the upper 95% quantile yields
        # - the cut-off for 2-sided t-tests at FWER 5%, AND
        # - the multiplier of standard errors for CIs with 95% family-wise confidence.
        # Let's get them for familywise alpha=0.05 and 0.01:
           quantile(t.max, prob=c(0.95,0.99))
           ##      95%      99%
           ## 2.858514 3.357303
           abline(v=quantile(t.max, prob=c(0.95,0.99)), col="red")
        # So your multiplier should be about 2.86 and 3.36, resp.:
        # - declare significance at FWER=0.05 if  |tj| > 2.86
        #                        at FWER=0.01 if  |tj| > 3.36
        # - construct family-wise 95% CIs as   [bj +- 2.86*SE[bj]]
        #             family-wise 99% CIs as   [bj +- 3.36*SE[bj]]
        # Reverse question: If you used the conventional cut-off,
           qt(0.975, df=N-p-1)
           ## [1] 1.964797
        # what would the FWER error rate be?
           sum(t.max > qt(0.975, df=N-p-1))/(Nsim+1)
           ## [1] 0.44328
        # ==> For over 44% of datasets there would be a false rejection
        #     under the simultaneous null hypothesis.

        # Here the results for the actual data:
           summary(lm(MEDV ~ ., data=boston))
        # ==> At FWER=0.05 there is no ambiguity.
        #     At FWER=0.01 several predictors are insignificant: CRIM, CHAS, TAX
        #       whereas ZN, B are just barely significant.


  - The min-pvalue approach to FWER control:

    . Example of a novel question that arises in practice:
        Sequential testing when acquiring more and more data.
        What problem would we face if we acquired the Boston Housing data one
        census tract after another and tested each time?
    . Consider the following simplified setup:
        We have a single predictor, and inference is sought for its slope.
        The data are acquired sequentially in groups of cases, and a t-test is
        performed after each time an additional group of cases is acquired.
        You can imagine the 'investigator' stops testing once the p-value falls
        below 0.05.
    . Technical difficulty:
      o In the above example we developed simultaneous inference for
        all 13 slopes in the Boston Housing data.  We did this with a max-|t|
        approach.  Note all t-stats had the same dfs.
      o Max-|t| doesn''t work in sequential testing because the degrees of
        freedom keep increasing as we get more data.
        This raises a general problem:
        Q: How can we make different types of test statistics comparable?
           For example, we may want simultaneous inference for
           {t with 10 dfs, and t with 20 dfs} or
           {chisq with 5 dfs, and F with 3,97 dfs} ?
        A: Translate the test statistics to their p-values.
           p-values have the virtue of being on a probability scale.
           'pval <= 0.05' means the same for t-stats, F-stats, chisq-stats,...
        ==> Instead of max-|t|, use min-pvalue.

         ----------------------------------------------------------
        |   The minimum-p-value aproach to simultaneous inference   |
         ----------------------------------------------------------

##    - Application to the sequential testing problem, just ONE predictor:

           Nsim <- 9999
           N <- 1000              # Maximum length of sequential 'data collection'
           Ns <- seq(10,N,by=10)  # Sample sizes to be tried sequentially ==> Try n= all multiples of 10
           x <- rnorm(N)          # Fixed single predictor; seq test its slope for 1:10, 1:20,...
           pval.05.locs <- rep(NA,length(Ns))
           pval.05.vals <- rep(NA,length(Ns))
           pval.05.num  <- rep(NA,length(Ns))
           pval.min <- rep(NA,length(Ns))
           for(isim in 1:Nsim) {       # THIS TAKES A COUPLE OF MINUTES!
               e <- rnorm(N)           # Sample a null response vector of max length
               successive.pvals <-     # Collect sequential p-values for this response vector
                   sapply(Ns,          # Ns = sequential sample sizes to be tried
                          function(n) { xn <- x[1:n]                        # regressor up to n
                                        en <- e[1:n]                        # null response up to n
                                        x.adj <- xn-mean(xn)                # adjusted regressor
                                        x.adj.1 <- x.adj/sqrt(sum(x.adj^2)) # x.adj normalized unit length
                                        res <- en - sum(en*x.adj.1)*x.adj.1 # residual vector
                                        s <- sqrt(sum(res^2)/(n-2))         # sigma hat
                                        numerator <- abs(sum(x.adj.1*en))   # numerator of t
                                        pt(numerator/s,n-2,lower=F)*2       # p-value, 2-sided
                                    } )
               # Now we have a succession of p-values for one sequence of 'datasets'!
               # The following code is due to undergrad student Elliot Oblander, Fall 2017:
               # who corrected the instructor's thinking mistake who had originally
               # collected minimal p-values in each simulation instead of first p-values < 0.05.
               sel.05 <- (successive.pvals < .05)  # Note p-values dipping below 0.05, if any.
               loc.05 <- ifelse(any(sel.05),       # Any?
                                which(sel.05)[1],  # Yes, get first location (n/10) where p-val < 0.05
                                NA) # No, no dip below 0.05
               pval.05.locs[isim] <- loc.05                   # Store location
               pval.05.vals[isim] <- successive.pvals[loc.05] # Store value at location
               pval.05.num[isim]  <- sum(sel.05)
               pval.min[isim] <- min(successive.pvals)
               if(isim%%500==0) cat(isim," ")
           }; cat("\n")
           #
           # Under the null hypothesis p-values have a uniform distribution on [0,1]
           # What is the distribution of the sequential minimum p-values?
           hist(pval.05.vals, breaks=100, col="gray")
           # What fraction of pvalues dip below 0.05?
           sum(!is.na(pval.05.locs))/(Nsim+1)          # More than 1/3 !!!!!!!!!
           #  [1] 0.3588
           # Examine the distribution of locations (i.e., sample sizes) where dipping below 0.05 first occurs:
           hist(pval.05.locs*10, breaks=100, col="gray", xlab="Sequential Sample Sizes n", main="")
           # ==> The majority of dips below 0.05 occur for small samples.
           #     This makes sense: small samples (n=10,20,...) have greater sampling variability
           #                  than large samples (n=1000, 990, 980, ...),
           #                  hence false rejections are more likely for small n.
           #
           # How many p-values below 0.05 per dataset?
           tab <- table(pval.05.num);  tab
           ##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
           ## 6247  845  401  291  231  168  146  126   96   88   72   81   67   50   63   53
           ##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31
           ##   39   46   34   43   37   37   34   21   28   21   25   27   25   15   24   22
           ##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47
           ##   19   18   12   16   20   25   17   14   11   19   18    5    8   10   10   23
           ##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63
           ##   11   13   16    4    5   10    8    6    5    8    3    6    8    5    4    4
           ##   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79
           ##   11    7   10    5    3    5    5    6    2    3    6    7    7    6    3    3
           ##   80   81   82   83   84   85   86   87   88   89   91   92   93   95   96   97
           ##    4    1    5    4    7    3    2    3    1    2    4    3    1    2    2    1
           ##   98
           ##    1
           # ==> 0 dip 0.05 in 6247 datasets out of 9999, 1 dip in 845 datasets, but also 98 dips in one dataset
           tab0 <- tab[]   # remove front: -1, -(1:2), -(1:3)
           plot(x=as.numeric(names(tab0)), y=as.vector(tab0), pch=16, cex=.8,
                xlab="Num dips", ylab="Freq")
           abline(h=seq(0,100,by=20), col="gray80")
           # Examine the distribution of minimum p-values:
           hist(pval.min, breaks=100, col="gray", xlab="Minimum p-value distribution", main="")
           # To gain FWER protection at 0.05, we should use the following per-test alpha:
           quantile(pval.min, probs=0.05)
           #          5%
           # 0.004394076
           # ==> To achieve FWER = 0.05 across the sequential tests from n=10 to n=N=1000 in steps of 10,
           #     each of the tests should be performed at per-test significance level
           #     of about 0.0044.
           # Bonferroni would be too conservative due to the strong dependence between successive tests:
           #            requires alpha = 0.05/length(Ns) = 0.05/100 = 0.0005.
           # The effective Bonferroni divisor should be:
           0.05 / quantile(pval.min, probs=0.05)
           # or about 11, not 100.
           #
           # INTERESTING QUESTION: Can we think of 11 as the appropriate degrees of freedom for
           #                       sequential data acquisition carried out with the above specifications?


    . Messages from the exercise of sequential testing:

      o The min-pvalue method of FWER-adjustment works always,
        never mind how heterogeneous your test statistics are.
        The beauty of the p-value scale is that it is a uniform
        probability scale across which different tests can be compared.

      o Reminder, general fact:

           ---------------------------------------------------------
          |  If the random variable X has a continuous cdf F,       |
          |  then both F(X) and 1-F(X) have U[0,1] distributions.   |
           ---------------------------------------------------------

          Therefore p-values are U[0,1] distributed under their H0s.

        This is what the p-value does: It passes the null distribution
        of the test statistic through the complement cdf 1-F.
        So if H0 is true, then the p-value should be U[0,1] distributed.
        [There is a messy detail for 2-sided tests: For example,
         |t| has a folded t-distribution on [0,Inf).
         But the principle still applies: 1-F(t) = P[|T|>t),
         where F(t) = P[|T|<=t]  is the cdf of |T|.
         hence:   p-value = 1-F(|T|) ~ U[0,1]
        ]

      o Remember this:

           ------------------------------------------------------
          |  If you select the smallest among several p-values,  |
          |  it is no longer a p-value.                          |
           ------------------------------------------------------

      o The sequential procedure that consists of collecting more and
        more data till one achieves significance is an example of
        what Simmons, Nelson and Simonsohn call

                        ----------------------
                       |   'p-value hacking'  |
                        ----------------------

        (Simmons and Simonsohn are in our OID dept; Nelson is at UCB.)

      o SN&S write that FWER-adjustment for p-value hacking is not feasible.
        They probably have Bonferroni in mind.  In fact, it would be silly
        to adjust the above exercise with Bonferroni:  0.05/100 = 0.0005.
        Why?  Because successive tests are VERY HIGHLY CORRELATED.
        However, we have shown with the above simulation that indeed

           ------------------------------------------------------
          |   FWER-ADJUSTMENT FOR P-VALUE HACKING IS POSSIBLE.   |
           ------------------------------------------------------

       o S&S developed a diagnostic for p-value hacking:
         P-value hackers are not really fishing for the smallest p-value,
         they just wait and juggle the data till they see p-values <= 0.05.
         This should create a preponderance of p-values just below 0.05,
         which is exactly what we saw in the histogram of dipping p-values:
           hist(pval.05.vals, breaks=100, col="gray")
         The S&S diagnostic essentially histograms the reported p-values <= 0.05
         and if they cluster just below 0.05, S&S say there is evidence of
         p-value hacking.


  - ADJUSTMENT OF OBSERVED P-VALUES FOR SIMULTANEITY:

    . We can ask how the raw p-values need to be adjusted to
      reflect the fact that they are one among several p-values,
      each of which could have been small due to chance because
      the minimum of several p-values is <= than each.

    . Q: Using the Bonferroni rule, how would you adjust p-values?
      A: multiply by the p-values by the number of tests, but cap at 1.
      Example: Bonferroni p-value adjustment for the 14 regression coeffs
               in the Boston Housing data

           pvals.raw.theory <- summary(lm(MEDV ~ ., data=boston))$coefficients[,"Pr(>|t|)"]
           pvals.adj.bonf   <- pmin(pvals.raw.theory * length(pvals.raw.theory), 1)  # Capped at 1
           round(cbind(raw =     pvals.raw.theory, bonf =  pvals.adj.bonf), 5)
           ##                  raw    bonf
           ##  (Intercept) 0.00000 0.00000
           ##  CRIM        0.00109 0.01522
           ##  ZN          0.00078 0.01089
           ##  INDUS       0.73829 1.00000
           ##  CHAS        0.00193 0.02695
           ##  NOX         0.00000 0.00006
           ##  RM          0.00000 0.00000
           ##  AGE         0.95823 1.00000
           ##  DIS         0.00000 0.00000
           ##  RAD         0.00001 0.00007
           ##  TAX         0.00111 0.01556
           ##  PTRATIO     0.00000 0.00000
           ##  B           0.00057 0.00802
           ##  LSTAT       0.00000 0.00000

    . A sharper method for p-value adjustment: the min-pvalue method
        Adjust the observed p-values for simultaneity by comparing them to 'pval.mins'.
        Thus we ask how many of the pval.mins fall below an observed pvalue.
        Example: min-pvalue adjustment of the Boston Housing coefficients
           # We use lazy programming, based on lm(), takes a while...
           pvals.raw.theory <- summary(lm(MEDV ~ ., data=boston))$coefficients[,"Pr(>|t|)"]
           Nsim <- 9999
           pvals.min.theory <-
               sapply(seq(Nsim),
                      function(j) {
                          if(j%%500==0) cat(j," ")
                          min(summary(lm(rnorm(nrow(boston)) ~ .-MEDV, # Exclude response
                                         data=boston)
                                      )$coefficients[,"Pr(>|t|)"]
                              )
                      } );  cat("\n")
           # [More concise code would use replicate() instead of sapply(), but it can't do intermediate printing.]
           #
           # Here the actual adjustments: Count the fraction of simulated min-pvalues below observed p-values:
           pvals.adj.theory <-
               sapply(pvals.raw.theory,
                      function(pval) (1+sum(pvals.min.theory <= pval)) / (1+Nsim) )
           # Results for comparison: raw and adjusted p-values, and the ratio (= effective 'Bonferroni' multiplier)
           round(cbind(pvals.raw.theory,
                       pvals.adj.theory),
                 5)
           ##             pvals.raw.theory pvals.adj.theory
           ## (Intercept)          0.00000           0.0001
           ## CRIM                 0.00109           0.0138
           ## ZN                   0.00078           0.0094
           ## INDUS                0.73829           1.0000
           ## CHAS                 0.00193           0.0229
           ## NOX                  0.00000           0.0002
           ## RM                   0.00000           0.0001
           ## AGE                  0.95823           1.0000
           ## DIS                  0.00000           0.0001
           ## RAD                  0.00001           0.0002
           ## TAX                  0.00111           0.0141
           ## PTRATIO              0.00000           0.0001
           ## B                    0.00057           0.0071
           ## LSTAT                0.00000           0.0001

           #
           # An issue is that simulation-adjusted p-values cannot produce p-values
           # lower than 1/(Nsim+1) (which it shouldn't), in this case 1/10,000 = 0.0001.
           # When the observed p-value is very small, << 1/(Nsim+1),
           # then the adjusted p-value can be larger than the Bonferroni adjustment.
           # It therefore makes sense to choose the smaller of the two adjustments:
           pvals.adj.theory <- pmin(pvals.adj.theory, pvals.raw.theory * length(pvals.raw.theory))
           #
           round(cbind("Raw" = pvals.raw.theory,
                       "Adj" = pvals.adj.theory,
                       "Adj/Raw" = pvals.adj.theory / pvals.raw.theory,
                       "Frac Bonf" = pvals.adj.theory / pvals.adj.bonf
                       ), 5)
           ##                 Raw     Adj  Adj/Raw Frac Bonf
           ## (Intercept) 0.00000 0.00000 14.00000   1.00000
           ## CRIM        0.00109 0.01290 11.86960   0.84783
           ## ZN          0.00078 0.00850 10.92391   0.78028
           ## INDUS       0.73829 1.00000  1.35448   1.00000
           ## CHAS        0.00193 0.02340 12.15565   0.86826
           ## NOX         0.00000 0.00006 14.00000   1.00000
           ## RM          0.00000 0.00000 14.00000   1.00000
           ## AGE         0.95823 1.00000  1.04359   1.00000
           ## DIS         0.00000 0.00000 14.00000   1.00000
           ## RAD         0.00001 0.00007 14.00000   1.00000
           ## TAX         0.00111 0.01350 12.14426   0.86745
           ## PTRATIO     0.00000 0.00000 14.00000   1.00000
           ## B           0.00057 0.00620 10.82290   0.77306
           ## LSTAT       0.00000 0.00000 14.00000   1.00000
           #
           # ==> Ratios of 14 mean that Bonferroni is fine.
           #     When the adjusted p-value is 1, the ratio is meaningless.
           #     For this dataset simulation-based adjustment produces
           #     relatively little benefit over Bonferroni adjustment:
           #     Simulation adjustment gives benefits only in 5 cases,
           #     and even for them the benefits is no betters than 77%
           #     of Bonferroni.

    . Sometimes there is no theoretical p-value available.
      This can happen when the test statistics you want to use
      have no known null distribution.

      However, you are learning the null distribution approximately
      as you perform a null simulation.  Shouldn''t you be able to
      perform simultaneous inference based on the min-pvalue approach?
      In fact you can.  Here is how:
      Collect the values of all test statistics in a matrix!
      This can be memory-intensive, but it this is about the principle.

      We do this exercise again with plain regression on the Boston Housing data,
      even though here we have theoretical p-values.
      The point is that we can compare simulated with theoretical raw p-values:
         #
            Nsim <- 9999
            X <- model.matrix(lm(MEDV ~ ., data=boston))
            y <- boston[,"MEDV"]
            N <- nrow(X)
            stats.obs <- abs(summary(lm(y ~ X-0))$coefficients[,"t value"])
            names(stats.obs) <- colnames(X)
            # Pre-allocate a matrix for all statistics of all variables:
            stats.null <- matrix(NA, nrow=Nsim, ncol=ncol(X))
            colnames(stats.null) <- colnames(X)
            for(isim in seq(nrow(stats.null))) {
                stats.null[isim,] <- abs(summary(lm(rnorm(N) ~ X-0))$coefficients[,"t value"])
                if(isim%%500==0) cat(isim,"  ")
            }  # Takes a few seconds.
            stats.null[1:5,]

        # Here is how we estimate the raw p-values from the simulation, one test statistic at a time,
        # assuming we don't know the marginal distributions of the test statistics:
            pvals.raw.sim <- sapply(colnames(stats.null),
                                    function(v) (1 + sum(stats.null[,v] >= stats.obs[v])) / (1+ nrow(stats.null)) )
        # [A more efficient way of computing the same is using matrix computations, but who cares for this size:]
            pvals.raw.sim.2 <- (1 + rowSums(t(stats.null) >= stats.obs)) / (1+nrow(stats.null))
            all(pvals.raw.sim == pvals.raw.sim.2)
        #
        # These p-values should be compared with 'pvals.raw.theory':
            round(cbind(pvals.raw.theory, pvals.raw.sim,
                        simulation.error=2*sqrt(pvals.raw.sim*(1-pvals.raw.sim)/(1+Nsim))  # sd of proportion
                        ), 4)
            ##              pvals.raw.sim pvals.raw.theory simulation.error
            ##  (Intercept)        0.0001           0.0000           0.0002
            ##  CRIM               0.0010           0.0011           0.0006
            ##  ZN                 0.0006           0.0008           0.0005
            ##  INDUS              0.7313           0.7383           0.0089
            ##  CHAS               0.0024           0.0019           0.0010
            ##  NOX                0.0001           0.0000           0.0002
            ##  RM                 0.0001           0.0000           0.0002
            ##  AGE                0.9602           0.9582           0.0039
            ##  DIS                0.0001           0.0000           0.0002
            ##  RAD                0.0001           0.0000           0.0002
            ##  TAX                0.0007           0.0011           0.0005
            ##  PTRATIO            0.0001           0.0000           0.0002
            ##  B                  0.0006           0.0006           0.0005
            ##  LSTAT              0.0001           0.0000           0.0002
        # The agreement between the two columns is within the expected range.

        Here is how you adjust these p-values for simultaneity:
        First make these statistic comparable by running them through their CDFs.
        These have to be estimated, which can be done with the rank transformation,
        We estimate 1-CDF basically by (N+1-rank)/N = 1-(rank-1)/N
        which is pretty much the empirical CDF based on simulation.
        By convention, we do this on the simulated and the observed statistics
        combined, the idea being that under the null the observed is just
        like one of the simulated:
        Under the null, each rank is equally likely for the observed.
        [We will place the observed at the top of the matrix
        #
        # Combine obs and null stats and rank the columns:
            stats.ranks <- apply(rbind(stats.obs, stats.null), 2, rank)
        # Translate the ranks to p-value scales using the formula  pval = (N+1-rank) / N
        # This translates rank=1 to pval=1 and rank=N to pval=1/N:
            stats.pvals <- 1 - (stats.rank-1)/nrow(stats.ranks)  # denominator = Nsim+1
        # These values will always be >= 1/(Nsim+1) = 1/nrow(stats.ranks)
        # Look at a few rows:
            stats.pvals[1:5,]
        # Now for simultaneity adjustment:
        # You have to ask what the minimum p-value is in each row:
            stats.pvals.min <- apply(stats.pvals, 1, min)
        # Next you ask each observed rank how extreme it is among the values
        # in 'stats.pvals.min':
            pvals.adj.sim <-
                sapply(colnames(stats.pvals),
                       function(v) sum(stats.pvals.min <= stats.pvals[1,v]) / nrow(stats.pvals) )
        #                              ^sim min pvals^    ^^sim obs pval^^
        # Compare:
            round(cbind(pvals.adj.sim, pvals.adj.theory), 4)
            ##              pvals.adj.sim pvals.adj.theory
            ##  (Intercept)        0.0008           0.0001
            ##  CRIM               0.0238           0.0161
            ##  ZN                 0.0124           0.0116
            ##  INDUS              1.0000           1.0000
            ##  CHAS               0.0312           0.0260
            ##  NOX                0.0008           0.0001
            ##  RM                 0.0008           0.0001
            ##  AGE                1.0000           1.0000
            ##  DIS                0.0008           0.0001
            ##  RAD                0.0008           0.0001
            ##  TAX                0.0149           0.0169
            ##  PTRATIO            0.0008           0.0001
            ##  B                  0.0113           0.0091
            ##  LSTAT              0.0008           0.0001

            # ==> Considerable agreement between the two adjusted pvalues.
            #     The adjustment by simulation has a higher bottom because of ties at
            #     the minimum p-value (~ maximal rank).
            #     Intuitively: The maximal rank is taken on in 8 rows,
            #                  so the minimum adjusted p-values is 8/(Nsim+1).

            # In this  example there was of course no need to estimate the CDF
            # with the rank transformation.  However, because a theoretical
            # CDF is available, we can compare with the theoretical adjustment
            # (which also relies on simulation but not for CDF estimation).


  - PS: Bikram''s remark:
        There exists theory and methodology for sequential testing.
        . Sequential Neyman-Pearson lemma (simple vs simple hypotheses)
        . A. Wald''s sequential testing based on likelihodd ratio tests

  - PS: Why the silly Nsim values such as 9999?
        Reason: The observed is treated as one of the null values
                under the null hypothesis.
    ==> We want Nsim+1 to be round, not Nsim.


                  THINK ABOUT THE FOLLOWING IDEA!

   -----------------------------------------------------------------
  |        How would you modify the min-pvalue approach             |
  |        if you DON''t want to control the FWER, i.e.,            |
  |          P[ number of false rejections > 0 ]  <=  alpha         |
  |        but the following criterion instead:                     |
  |          P[ number of false rejections > 1 ]  <=  alpha         |
  |        I.e., you''re ok with one false rejection but not two.   |
   -----------------------------------------------------------------

        Hint: Modify the min-pvalue statistic.

        Check papers by Romano (@ Stanford), or Lehmann & Romano (book)

#====================================================


  - Scheffe simultaneous inference:

    . Sometimes we are interested not just in inference
      for betaj''s, but also for linear combinations thereof.
      Consider a categorical variable with 3 groups:
      base group A and other groups B and C.
      One may be interested in any kind of difference
      among the groups, individually and after arbitrary
      pooling.  Examples:
         B-A                = betaB
         C-A                = betaC
         C-B                = betaC - betaB
         (B+C)/2 - A        = 1/2*betaB + 1/2*betaC
         (A+C)/2 - B        = 1/2*betaC - betaB
         B*2/3 + A*1/3 - C  = 2/3*betaC - betaB
         ...
      So every linear combination c1*betaB + c2*betaC
      represents a generalized difference among the
      groups A, B, C:

      (-c1-c2)*A + c1*B + c2*C  =  c1*betaB + c2*betaC  (*)

      [Coefficient vectors such as (-c1-c2,c1,c2) that sum up
      to zero are called "contrast vectors", and the
      linear forms (*) they generate are called "contrasts".
      If A is the reference group, then the coefficients reduce
      to arbitrary vectors (c1,c2) forming a linear space.

      One would like to search through arbitrary meaningful
      contrasts and assess their statistical significance
      with RIs and CIs.

      Note that the multiple regression may include other
      predictors than this one categorical variable (i.e.,
      the two dummy predictors the three groups require),
      hence we are interested in linear forms of the betas with
      coefficient vectors that form a subspace of R^{p+1}.


  - Summary statement of the purpose of Scheffe simultaneous inference:

    . Consider linear forms of regression coefficients:

         c0*beta0 + c1*beta1 + ... cp*betap      (**)

      for (c0,c1,...,cp) in a linear subspace of R^{p+1}.

      Scheffe provides simultaneous RIs and CIs for all such
      derived parameters.

    . Scheffe provides simultaneously valid inference by
      providing t-statistics for all derived parameters (**)
      and performing the max-t approach for these uncountably
      many t-statistics.

    . This sounds scarier than it is, but the max-t distribution
      turns out to be a root-F distribution, and the F derives
      from the F-test of the assumption that all (**) are zero,
      as we shall see now.


  - Technical execution of Scheffe simultaneous inference:

    . Let V be a linear subspace of R^{p+1}.
      Denote its elements by  c = (c0,c1,...,cp)^T

    . In the above example, if there are predictors other
      than the categorical variable, then the vectors 'c'
      would be zero other than for betaB and betaC:
        V = { (0,0,...,0,cB,cC,0,...0)^T | cB,cC in R }

    . Translate the estimate 'c^T b' of the parameter 'c^T beta'
      to linear forms of y rather then b.  Because each bj is
      a linear form of y through the adjusted predictors formula,
      the linear form c^T b = c0*b0 + c1*b1 + ... + cp*bp will
      also be a linear form of y.  This time it is easier to
      go through the triple-x formula, though:

             b  =      (X^T X)^{-1} X^T y
         c^T b  =  c^T (X^T X)^{-1} X^T y

      ==>    C :=  X (X^T X)^{-1} c

         c^T b  =  C^T y

      C contains the coefficients of the linear form of y
      that produces the estimate  c^T b  of  c^T beta.

          W := { C = X (X^T X)^{-1} c | c in V }

      is also a linear space, but it is a subspace of R^N,
      in fact, a subspace of cspan(X).


    . What is the t-statistic for testing H0: c^T beta = 0 ??
      Answer:
                  t_C  =  < y, C/|C| > / s

    . How can we achieve simultaneity of inferences for all C in W?
      Answer:  max-t over all C in W

               max_{C in W\0} |t_C|  =  ???

      This problem can be solved explicity through geometry.
      Note that we are looking for the unit vector C/|C| in W
      with the largest cosine, i.e., smallest angle, to y.

      ==>  C ~ orthogonal projection of y onto W

      Let  H_W  be the orthogonal projection onto W.

               max_{C in W\0} |t_C|  =  max_{C in W\0} |<y, C/|C|>| / s

                                     =  |<y, H_W y / |H_W y| >| / s

                                     =  <H_W y, H_W y / |H_W y| > / s

                                     =  |H_W y| / s

      NOTE: The null distribution of

               (|H_W y|^2 / dim(W)) / s^2  ~  F(dim(W),N-p-1)

               |H_W y|^2 / s^2             ~  F(dim(W),N-p-1) * dim(W)

      ==>      |H_W y| / s                 ~  sqrt(F(dim(W),N-p-1) * dim(W))

      ==> Use 1-alpha quantiles as multipliers of standard error of  c^T b.
          Keep in mind we are working under a null assumption: y = eps ~ N(0,sig^2 I)

      Standard error of c^T b:

        V[c^T b] = c^T V[b] c = c^T (X^T X)^{-1} c * sigma^2

      CI for c^T beta:

       ----------------------------------------------------------
      |  [ c^T b  +-  q * sqrt(c^T (X^T X)^{-1} c) * s ]         |
      |                                                          |
      |  where   q = sqrt( qf(1-alpha, dim(W), N-p-1) * dim(W) ) |
       ----------------------------------------------------------

         Recall: dim(W) = dim(V) = dim of space of contrasts
                 Reason: We assume X has full rank, else (X^T X)^{-1} would not exist.

      Gambling guarantee: We don''t know ever whether 'c^T beta in [ ... ]'
                          is true for our dataset,
                          but it is the case for a fraction 1-alpha of datasets.

         P[ c^T beta  in  [ ... ] forall c in V |  beta is true ]  =  1-alpha
         ^                        ^^^^^^^^^^^^^
         Fraction of datasets     Simultaneity over contrasts with coefficient vectors c in V.

  - Comparison of interval widths:

    . Graphical illustration:

             # Residual dfs: Arbitrary choices, here for the Boston Housing data
             df.res <- 506-13-1
             # Number of tests/CIs, or dimension for Scheffe
             df.num <- 1:25     # Boston: 13 if all real regressors (no intercept) are included
             # Simultaneous significance level, 1-alpha = simultaneous coverage:
             alpha <- 0.05
             # Scheffe multipliers:
             qu <- sqrt(df.num*qf(1-alpha, df.num, df.res))
             plot(df.num, qu, pch=16, ylim=c(0,max(qu)), ylab="Multiplier",
                  xlab="Dim of contrast space or Number of tests")
             # No multiplicity adjustment: same multiplier regardless
             qu <- rep(qt(alpha/2, N-p-1, lower=F), length(df.num))
             points(df.num, qu, col="gray", pch=16)
             # Exact multiplier for independent tests (~ incorrect in lin models due to shared s):
             qu <- qt((1-(1-alpha)^(1/df.num))/2, df.res, lower=F)
             points(df.num, qu, col="blue", pch=16)
             # Bonferroni:
             qu <- qt(alpha/df.num/2, N-p-1, lower=F)
             points(df.num+.1, qu, col="green", pch=16)
             # Max-|t| for Boston Housing data, 1, 2, 3,..., 13 predictors in order
             Nsim <- 99999;  N <- nrow(X.adj)
             t.cmax <- matrix(NA, nrow=Nsim, ncol=ncol(X.adj))  # Pre-allocate a long vector for the max-|t| values.
             pp <- 1:ncol(X.adj)
             for(isim in seq(Nsim)) {
                 t.cmax[isim,] <- cummax( abs(crossprod(X.adj, rnorm(N))) / sqrt(rchisq(1,df=N-pp-1)/(N-pp-1)) )
                 if(isim%%10000==0) cat(isim,"  ")
             }  # Takes quite a few seconds.
             qu <- apply(t.cmax, 2, quantile, 0.95)
             points(pp, qu, col="red", pch=16)
             # Legend:
             legend("topleft", rev(c("Unadjusted","Max-|t| (Boston)","Independence","Bonferroni","Scheffe")),
                    text.col=rev(c("gray","red","blue","green","black")) )

      ==> Adjustment with Bonferroni and for exact independence are very close.
          Scheffe intervals are the widest because they provide the strongest guarantees.


    . Asymptotic comparison of interval widths:

      + Independent max-|z|: the multiplier of standard error
        grows with the number 'ntest' of comparisons like

                   ntest <- 1:25
                   sqrt(2*log(ntest))

        This is an asymptotic upper bound on independent max-|z| situations.
        (This type of large-'ntest' asymptotics follows from extreme value theory
         of the normal distribution.)

      + Scheffe: the multiplier of standard error
        grows with the dimension dim(V) of contrasts like

                   sqrt(dim(V))

        Thus Scheffe costs a lot more than finite max-|t| for large 'ntest'.
        (Think of the width of CIs and RIs as 'cost'.)


    . Another geometric intuition:

      + Finite max-|t| describes boxes in q-space.

      + Scheffe describes balls in q-space.

      As q grows, the radius of balls needs to be a lot bigger than
      the half-width of boxes to capture the same coverage probability.


  - A particular application of Scheffe intervals: inference for conditional means

    . Recall the conditional mean at location xx in predictor space:

                        mu(xx) = E[y|xx] = xx^T beta

    . Estimate:         yhat(xx) = xx^T b

    . Standard error^2: V[yhat(xx)] = xx^T (X^T X)^{-1} xx * sigma^2

    . CI for mu(xx):    CI(xx)  :=  [ yhat(xx) +- q*sqrt(xx^T (X^T X)^{-1} xx)*s

    . Guarantee at xx:  P[ mu(xx) in CI(xx) ]  =  1-alpha

                          where    q=qt(alpha,N-p-1,lower=F)   (t-distribution!!!)

    . Simultaneous guarantee for ALL xx: This actually requires simultaneity for ALL xx!

                        P[ mu(xx) in CI(xx) for all xx ]  =  1-alpha    for what q???

      Scheffe:          q = sqrt(p*qf(alpha, p, N-p-1, lower=F))    (root-F-distribution!!!)

      Reasoning:        We want simultaneous inference for all  mu(xx) = xx^T beta,
                        hence the coefficient vectors are  c = xx.
                        Minor kink:  xx=(1,x1,...,xp)^T do not form a linear space.
                                     ==> Remove the intercept by centering the cols of X and y.
                        Write the new location coordinates again as xx but xx = (x1,...,xp)^T.
                        Now the xx form a linear space of dimension p.
                        Apply Scheffe with dimension dim(W)=p.

                        One more hitch: We also need to assume that the predictors are
                        interval scale (making the whole real line accessible to them).
                        For ratio scale or prob scale or dummies or counting scales,
                        we are not obtaining a space of 'coefficients' formed by the
                        location vectors xx.

       More details:    xx^T betahat =  xx^T (X^T X)^{-1} X^T y

                           C(xx) =  X (X^T X)^{-1} xx

                        We really only need its normalized version to form

                           t(xx)  =  < C(xx)/|C(xx)|, eps > / sigmahat

                        Forming the max-|t| we have

                           max_xx  |t(xx)|  =   max_xx  |< C(xx)/|C(xx)|, eps >| / sigmahat

                        What kind of set is formed by

                           S  :=  { C(xx)/|C(xx) :  xx = (x1,x2,..,xp) }    ?

                        It is a subset of the unit sphere in R^N,
                        but its intrinsic dimensionality is no more than p.
                        (After empirically centering the columns of the design matrix, the
                         origin of xx-space corresponds to the p-tuple of empirical means.)

                        Fact: If the potential values of xx-space contain an open neighborhood
                              of the origin, then the set S is a (p-1)-sphere in a p-dim
                              subspace of N-space.

                        ==>   All continuous scales (interval, ratio, probability/freq)
                              result in full Scheffe inference if simultaneity is desired
                              for ALL POSSIBLE LEGITIMATE locations in predictor space.

                        ==>   Discrete scales (dummy, counting) result in a set S that
                              consists of finitely or countably many points that are
                              not dense in the (p-1)-sphere in p-dim subspace in N-space.
                              This case does not require full Scheffe inference.
                              Ex: 5 dummy variables form a set of 32 possible combinations
                                  of 0 and 1 values, hence their set S consists of just
                                  32 points.



#====================================================


   - Implication of FWER control: "STRONG ERROR CONTROL" (often, not always)

      Assume you performed a large number p of t-tests: t1, t2, ..., tp.
      Assume you used a threshold K for simultaneous inference based on max-|t|:

          P[ max |tj| > K | beta1 = beta2 = betap = 0 ]  =  alpha   (*)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
                              Global null hypothesis
                       (excludes testing the intercept beta0)
      Problem:

       -----------------------------------------------------
      |  In actual data analysis, some of the tj''s will    |
      |  have resulted in rejections, others not.           |
      |                                                     |
      |  What can we say about the rejections?              |
      |  Will they be true alts with high probability?      |
      |                                                     |
      |  Conversely, what can we say about retentions?      |
      |  Will they be true null with high probability?      |
      |                                                     |
      |  If we use conservative statistical inference       |
      |  calibrated under a global null hypothesis,         |
      |  what does this imply for a mix of nulls and alts?  |
       -----------------------------------------------------


      FRAMEWORK: Illustrated with regression, but works for
                 general multiple testing situations.


        INDEX SETS for parameters/regression coefficients being tested:

          H00 := { 1, 2, ..., p }                     indeces of the GLOBAL null (UNREALISTIC)
          H0  := { j :   betaj == 0, 1 <= j <= p }    indeces of TRUE NULLS  (UNKNOWN !!!!!!!!!!!!)
          H1  := { j :   betaj != 0, 1 <= j <= p }    indeces of TRUE ALTS   (UNKNOWN !!!!!!!!!!!!)

        FWER control:

         ---------------------------------------------------------------------
        |  sup_{P in H00.dist} P[ ( max_{j in H00} |tj| ) > K ]   <=   alpha  |
         ---------------------------------------------------------------------


          Note:   H00 = H0 U H1   (union of sets)

        The following we do know from data -- they are the results of our testing:

          H0.hat := { j :  |tj| <= K }            Retentions
          H1.hat := { j :  |tj| >  K }            Rejections

        Both are RANDOM INDEX SETS !!!!!!!!!!
        They differ from dataset to dataset.

        MIND GAME:

          We do not know the true state of nature, P(dy) = N(X beta, sig^2 I) and
          the true H0 and H1 obtained from the true beta = (beta0,beta1,...,betap)^T.
          But we hope that H0.hat and H1.hat can inform us about H0 and H1
          with high probability (i.e., for most datasets).


       ----------- THEOREM: STRONG ERROR CONTROL --------------
      |                                                        |
      | . All rejections are true alts with high probability:  |
      |                                                        |
      |       P[ H1.hat subset of H1 ]  >= 1-alpha      (1)    |
      |                                                        |
      |                                                        |
      | . All true nulls are retained with high probability:   |
      |                                                        |
      |       P[ H0 subset of H0.hat ]  >= 1-alpha      (2)    |
      |                                                        |
       --------------------------------------------------------

       FACT: (1) <==> (2)
             The equivalence follows from simple set theory:
             A subset of B  <==>  B^C subset of A^C  (complements)


       NOTE: These results are all we can hope for in this generality.

       - There is no way to find a guarantee of the following kind:
           P[ H1 subset of H1.hat ]  >= 1-alpha      (1) WRONG, WRONG, WRONG !!!!
         That is, a guarantee that our rejections contain ALL true alternatives.
         Q: Why is this impossible?
         A: Alternatives can be very weak, e.g., betaj close to 0 but still != 0.
            Realistic sample sizes might be insufficient to reach rejection for such betaj.
            [Recall that standard errors converge to zero but slowly, ~ 1/sqrt(N).]

       - Equivalently, there is no way to find a guarantee such as
           P[ H0.hat subset of H0 ]  >= 1-alpha      (2) WRONG, WRONG, WRONG !!!!
         That is, a guarantee that ALL our retentions are true nulls.
         Q: Why is this impossible?
         A: Some retentions may arise from weak alternatives, betaj ~ 0 but != 0,
            whose rejection is unattainable with realistic sample sizes.

       - In practice we can do power calculations for |betaj| > eps > 0.
         I.e., we decide what kind of 'effect size' |betaj| > eps is worth looking for.
         We can then calculate a minimal sample size to have a prescribed power
         (= prob of rejection under a worthwhile alternative, |betaj| > eps > 0).


       FOR THE PROOF OF STRONG ERROR CONTROL, DEFINE SETS OF DISTRIBUTIONS:

         Full Model = { P(dy) = N(X beta, sig^2 I):  beta in Reals^{p+1},     sig > 0 }
         H0.dist    = { P(dy) = N(X beta, sig^2 I):  betaj == 0 for j in H0,  sig > 0 }
         H00.dist   = { P(dy) = N(X*beta, sig^2 I):  betaj == 0 for j in H00, sig > 0 }

                    -----------------------------------------------------
        IMPORTANT: |  H0 subset H00,  but  H0.dist superset of H00.dist  |  !!!!!!!!!
                    -----------------------------------------------------
                         ^^^^^^                    ^^^^^^^^
                   H0 constrains fewer parameters than H00, hence
                   H0.dist has more distributions than H00.dist.
                   In fact:  H00.dist = { N(x0*beta0, sig^2 I): beta0 in Reals, sig > 0 }


        NOTE: H0.P contains distributions with betaj == 0 for j in H1.  That''s ok.
              We actually don''t need H0 to be the exact set of true nulls.
              H0 can be any subset of the true nulls.

      PROOF:       inf_{P in H0.dist}  P[ H0 subset of H0.hat]
                 = inf_{P in H0.dist}  P[ ( max_{j in H0}  |tj| ) <= K ]   \ ????????????????????
                ?= inf_{P in H00.dist} P[ ( max_{j in H0}  |tj| ) <= K ]   / ????????????????????
                >= inf_{P in H00.dist} P[ ( max_{j in H00} |tj| ) <= K ]   from FWER control over H00
                >= 1 - alpha

      SUBTLETY:

      . To make the proof work, we need an argument for  '?=' --> '>=' .

      . However, at that stage we have the opposite inequality:  '?=' is '<=' .
        Reason: Because H0.dist is a superset of H00.dist,
                inf over H0.dist is <= inf over H00.dist,
                which is the wrong direction from what we need.

      . Yet if suitable PIVOTALITY holds, we get  '==' !!!
        Here is how: The vector of null t-statistics, i.e,
                     { tj: j in H0 }
        has a joint distribution that is independent of the
        nuisance values  betaj  for  j in H1.

        Reason:  { xj.adj: j in H0 }  _|_  { xk: k in H1 }

          ==>  numerator distributions of tj do not depend on
                 sum_{k in H1} xk betak
               This nuisance part of the model is orthogonalized out
               by  xj.adj  for j in H0  in the numerators of  tj.

          ==>  For  j in H1, we can set  betaj==0, and
               the null distribution of  { tj: j in H0 }  is the same.

          ==>  '?='  above is not only  '<='  but  '=='.

      . While there is no deep math, there is subtlety in the proof.

      . Counter example where pivotality is violated: correlations !
          T12 = corr(X1,X2);  T13 = corr(X1,X3);  T23 = corr(X2,X3)
        These are sample correlations used as test statistics.
        Null hypothesis:  C(X1,X2) = 0  and  C(X1,X3) = 0
        Global  ''     :  C(X1,X2) = 0  and  C(X1,X3) = 0  and  C(X2,X3) = 0
        [C(.,.) is the true population correlation.]
        Why is the joint distribution between T12 and T13 affected by
        whether C(X2,X3) = 0 is true or not?
        Think of X1 as the response.
        Then T12 and T13 are actually equivalent to t-statistics for
        simple regressions of X1 on X2 and X1 and X3, respectively.
        Now imagine that X2 and X3 are highly collinear, i.e., their
        joint distribution is VERY far from the null hypothesis C(X2,X3) = 0.
        T12 and T13 will then be VERY highly correlated with each other.
        Of course they are correlated even when C(X2,X3) = 0, but less so.
        At any rate, the joint distribution of T12 and T13 is affected
        by whether the larger null hypothesis is true or not.


  - Challenge question:

      Can one generalize the strong error control result
      from  max-|t|  to  max2-|t|, where
      max2-|t| is the second largest |tj|?

      The desired result would be that whatever the set of true nulls is,
      we fail to find at most one true null:

              P[ | H0 \ H0.hat | <= 1 ]  >= 1-alpha     (2 - Miss-1 Strong Error Control)

              P[ H0 subset of H0.hat ]  >= 1-alpha      (2 - Original Strong Error Control)

      Conversely:

              P[ | H1.hat \ H1 | <= 1 ]  >= 1-alpha     (1 - Miss-1 Strong Error Control)

              P[ H1.hat subset of H1 ]  >= 1-alpha      (1 - Original Strong Error Control)



  - Next would be: FDR -- False Discovery Rate

      ==>  An extremely important concept that provides
           narrower intervals with a different guarantee.

           But it works best under the assumption of independent tests.


================================================================