PSGLM The routines included here implement the projected score method for o generalized linear models with o canonical links for data sets with o independent strata. The objective is to do conditional inference, with the conditioning event being the strata totals. Examples are conditional logistic regression, Rasch models, adjustment for ties in Cox proportional hazard model, exponential regression ... Essentially the strata model implemented here allows for a strata specific parameter allowing each subject/strata to have an individual intercept -- this is one way of accounting for observed heterogeneity. Contents: README this file. Code directory. psglm.q: the main routine auxpsglm.q: support routines -- the numerical differentiation and cumulant calculations. famtoo.q: implements cumulants and familytoo classes plus print methods for all these functions. summary for psglm. Help directory -- help files for the following functions. psglm cumulants familytoo Examples directory. coxreg.q conditional logistic regression for a log odds ratio raschwei.q conditional estimates for a Rasch model. paranoia.q just me making sure that regular glm's are a special case. Useful for bug checking. References are in the psglm help file in the Help directory. Some reasons for using this fitting method. o Because you like conditional inference. o Subject specific parameter interpretations. o Accounts for heterogeneity. o Better small sample properties than full maximum likelihood. o Makes no assumption about the strata parameters. Mixture models put iid assumptions at some stage on these strata parameters. If you believe that there is some sort of inclusion bias then conditional inference may be a good idea. The projected-score method involves higher order asymptotics so I made a cumulant class that returns the first four cumulants. Also I extended the family class to a "familytoo" class that simply adds the first four cumulant functions to the original family. There are print methods for these classes. I think of the data as a matrix and therefore the covariates as a three dimensional array. The psglm function requires the data to be arranged as a matrix -- quite convenient for longitudinal type data. The design array can either be entered as a matrix or an array. A matrix is appropriate when the within strata covariates are identical for all strata. Convergence issues. Some reasons for problems. o Uninformative strata in the conditional sense. Fix: remove them. o Very slow convergence. Fixes: Make the strata covariate matrices orthogonal to the 1 vector. (Can square root the number of steps to convergence.) Toggle between the two types of information matrices. Using the correct information matrix may not always be advantageous if the starting values are not good. Other comments. Why not write this as a "method = psglm.fit" for the original glm function? I am solving a modified equation rather than maximizing a likelihood. Deviance is not well defined. Very different in this respect, many of the glm components would be NULL. I want y to be matrix and z a three way array. Why do all of this in S. It's so slow. Perversely, that was my goal, to see if I could implement this entirely in S. (I've got separate stand alone Fortran routines that require the BLAS) I've avoided loops for the most part, vectorizing whenever possible. For example with 5000 strata and 40 columns I get (on my shared HP 735) unix.time(summary(psglm(y=y,x=x,family=poisson))) [1] 364.5 1.97 392.0 0.00 0.00 A little over five minutes. How fast would an EM algorithm go for a mixed model? That's estimating 5000+ parameters even though I'm only interested in a small subset of them. I think that it is at the limit though, almost any extension whatsoever is going to require calls to some compiled routines, particularly for the numerical differentiations. To do: na.action Filter missing data and non-informative strata. include non-canonical links allow for more general conditioning events, ie conditioning on interesting contrasts, margins of multiway contingency tables for example. Gaussian stays boring but Poisson families are more interesting though. provide confidence intervals for univariate parameters by inverting score tests -- doing the Wald thing is bad. Any suggestions? email: waterman@compstat.wharton.upenn.edu