############### HOMEWORK 1, STAT 541, DUE THU, SEPT 18, 2009, 12 NOON ############### # YOUR NAME: .... # RULES: # # 1) You can discuss the homework with each other in general terms, # but you must write your own solutions and not copy from anyone. # You must not consult previous years' solutions that might be similar. # You are under honor code wrt these two stipulations! # # 2) Your R code is not permitted to use loops or the 'apply' function. # # 3) Work your way through the R intro up to the point before 'LISTS' # but including 'ARRAYS'. # # 4) Edit your answers into this file following each "ANSWER:". # # 5) Send questions and your solutions (in attachments) to # stat541.at.wharton[at-sign]gmail.com # The solutions file must have extension '.txt', NOT '.doc'. # The filename should have this format: LastName-FirstName-hw01.txt # An example would be: Buja-Andreas-hw01.txt # # 6) Essential criteria for grading, in this order of importance: # - Generalizability of the solution: Generalizability refers to how # difficult it would be to modify the code to work for larger or # slightly different problems. # - Full use of R's expressive power, or "Thinking in the Language": # If among two solutions one looks more like C or Perl code than R code, # it will be the less preferable one. # - Conciseness: Among two solutions, usually the shorter is preferable. # PROBLEM 0: # Set yourself up with R on your computer. # Follow the instructions in Stat-541-R-intro.R from the class website. # PROBLEM 1: # What happens in this expression? 1:10 - 0:1 # ANSWER: ## The 2-vector 0:1 gets recycled to length 10, then subtracted from 1:10. # PROBLEM 2: # Create a vector consisting of the numbers 1 to N, where 1 appears # once, 2 appears twice, 2 appears 3 times,... # Show results for N=5. # ANSWER: N <- 5 rep(1:N, 1:N) # PROBLEM 3: # Evaluate the expressions: -3:5; (-3):5; -(3:5) # What does this tell you about operator precedence of '-' and ':'? # By comparison, evaluate this: 1-3:5; 1-(-3):5; 1-(3:5) # Comment? If you had been the creator of R, # would you have designed operator precedence this way? # ANSWER: ## The unary '-' binds stronger than ':', but the binary '-' does not. ## Some may object, but here is what is going in favor: ## - In most programming languages unary operation have precedence over binary. ## - For most people precedence of ':' over binary '-' "feels right" ## because then an expression such as 2:4-3:5 means (2:4)-(3:5), ## whereas with precedence of binary '-' over ':' the expression ## would have to be illegal: 2:(4-3):5 == ??? ## For those who insist on uniform precedence of unary and binary '-' ## in relation to ':' something has to give: ## - Give up on precedence of unary over binary in this case, so that ## -3:5 is -(3:5), or ## - give binary '-' precedence over ':' and hence make 2:4-3:5 illegal. ## ## As it is, we are forced to strictly distinguish between unary and binary '-'. ## Still, for human readability, it is better not to rely on rules of precedence ## too much, especially when they are open to debate. In this case it would ## be friendly to readers to write unary -3:5 as (-3):5 even though they are the same. ## AB: Justin Rising and William Stacey contributed to clarifying some of these issues. # PROBLEM 4: # Create a vector of integers between 1 and N # that are neither divisible by 3 nor by 5. # Show results for N=100. # ANSWER: N <- 100 a <- 1:N a[a%%3!=0 & a%%5!=0] ## This is probably the most readable solution. ## Jun Chen abbreviates it by using implicit coercion to replace '=!0': a[a%%3 & a%%5] ## Another solution based on exclusion: a[-c(seq(3,N,by=3),seq(5,N,by=5))] setdiff(a, c(seq(3,N,by=3),seq(5,N,by=5))) ## Sometimes one finds different ways of thinking, such as in this solution: a[c(T,T,F,T,F,F,T,T,F,F,T,F,T,T,F)] ## This generalizes to any N, but not easily to other divisors than 3 and 5. ## All solutions produce: [1] 1 2 4 7 8 11 13 14 16 17 19 22 23 26 28 29 31 32 34 37 38 41 43 44 46 [26] 47 49 52 53 56 58 59 61 62 64 67 68 71 73 74 76 77 79 82 83 86 88 89 91 92 [51] 94 97 98 # PROBLEM 5: # Create a vector of length N that contains 1 + # the cumulative sums of reciprocals of cumulative products # of the integers 1,2,...,N. Show results for N=10. # What does the vector converge to? # (Find functions for cumulative products and sums.) # ANSWER: N <- 10 1 + cumsum(1/cumprod(1:N)) [1] 2.000000 2.500000 2.666667 2.708333 2.716667 2.718056 2.718254 2.718279 [9] 2.718282 2.718282 ## Converges to the number 'e', the Euler constant. ## It is incorrect to say that the sequence converges to 2.718282. (Why?) # PROBLEM 6: # Create a vector of the vowels and a vector of the consonants # from the standard dataset 'letters' or 'LETTERS'. # Then create code that generates random 3-letter 'words' # consisting of a random vowel, a random consonant, and # another random vowel, in this order. Example: 'rab'. # Show the results for N=10 'words'. # ANSWER: sel <- c(1,5,9,15,21) vow <- letters[sel] cons <- letters[-sel] ## or: vow <- c('a','e','i','o','u') cons <- setdiff(letters, vow) ## or: Thanks to Fan Li vow <- grep("[aeiou]",letters,value=T) cons <- grep("[^aeiou]",letters,value=T) ## Here are ten 'words' consisting of random vowel-consonant-vowel: N <- 10 paste(sample(vow,N,repl=T), sample(cons,N,repl=T), sample(vow,N,repl=T), sep="") # PROBLEM 7: # Create a vector containing all possible pairs of # consonants and vowels (in this order). Example: 'xa'. # ANSWER: ## One approach is to repeat each vowel as many times ## as there are consonants, and match each with a consonant: paste(cons, rep(vow,rep(length(cons),length(vow))), sep="") ## Note that the first argument gets repeated to the length of the second. ## If you know the 'outer' function, here is an elegant solution: c(outer(cons, vow, paste, sep="")) ## 'c()' strips the matrix attribute. ## Here is a solution due to Jose Zubizarreta that relies on a function ## used in factorial designed experiments to list all combinations of ## levels of two factors: x <- expand.grid(cons, vow); paste(x[,1],x[,2],sep="") ## All of the above create the following: [1] "ba" "ca" "da" "fa" "ga" "ha" "ja" "ka" "la" "ma" "na" "pa" "qa" "ra" "sa" [16] "ta" "va" "wa" "xa" "ya" "za" "be" "ce" "de" "fe" "ge" "he" "je" "ke" "le" [31] "me" "ne" "pe" "qe" "re" "se" "te" "ve" "we" "xe" "ye" "ze" "bi" "ci" "di" [46] "fi" "gi" "hi" "ji" "ki" "li" "mi" "ni" "pi" "qi" "ri" "si" "ti" "vi" "wi" [61] "xi" "yi" "zi" "bo" "co" "do" "fo" "go" "ho" "jo" "ko" "lo" "mo" "no" "po" [76] "qo" "ro" "so" "to" "vo" "wo" "xo" "yo" "zo" "bu" "cu" "du" "fu" "gu" "hu" [91] "ju" "ku" "lu" "mu" "nu" "pu" "qu" "ru" "su" "tu" "vu" "wu" "xu" "yu" "zu" ## WARNING: The following creates the correct output but is NOT a solution: paste(rep(cons,length(vow)), vow, sep="") ## This generates the correct output because 21=length(cons) and 5=length(vow) ## are relatively prime. If one includes 'y' among consonants, the scheme goes bad: sel <- c(1,5,9,15,21,25) vow <- letters[sel] cons <- letters[-sel] unique(paste(rep(cons,length(vow)), vow, sep="")) ## Why are we getting only half the combinations? Because GCD(6,20)=2. # PROBLEM 8: # Create the following matrix elegantly: [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 1 0 1 0 1 0 1 0 1 [2,] 1 0 1 0 1 0 1 0 1 0 [3,] 0 1 0 1 0 1 0 1 0 1 [4,] 1 0 1 0 1 0 1 0 1 0 [5,] 0 1 0 1 0 1 0 1 0 1 [6,] 1 0 1 0 1 0 1 0 1 0 [7,] 0 1 0 1 0 1 0 1 0 1 [8,] 1 0 1 0 1 0 1 0 1 0 [9,] 0 1 0 1 0 1 0 1 0 1 [10,] 1 0 1 0 1 0 1 0 1 0 # ANSWER: ## The following all work: N <- 10 matrix(1:0,N+1,N)[-1,] (matrix(1:N, nrow=N,ncol=N, byrow=T)+1:N) %% 2 # Thanks to Ceyhun Eksin x <- 1:N %% 2; matrix(c(1-x,x), N, N) x <- 1:N %% 2; cbind(1-x,x)[,rep(c(1,2),N/2)] x <- 1:N %% 2; matrix(c(1-x,x), nrow=N, ncol=N) # Thanks to Adam Kapelner x <- rep(0:1, 5); abs(outer(x, x, FUN="-")) # Thanks to Jordan Rodu x <- matrix(ncol=N,nrow=N); (row(x)-col(x))%%2 x <- matrix(ncol=N,nrow=N); (row(x)+col(x))%%2 x <- cbind(c(0,1),c(1,0)); i <- rep(1:2,N/2); x[i,i] ## The last three solutions are probably the most generalizable, ## in different ways. The former generalizes to other ## diagonal repeat patterns by working off (row(x)-col(x)) ## or (row(x)+col(x)) such as abs(row(x)-col(x))%%3 (row(x)+col(x)-2)%%3 ## whereas the latter generalizes to other translational ## repeat patterns such as x <- cbind(c(0,1),c(2,3)); i <- rep(1:2,5); x[i,i] ## Finally, the funniest solution of them all, due to Emil Pitkin: outer(rep(7:8,5),rep(c(7,2),5),FUN="%%") # PROBLEM 9: # Create a 6x3 matrix named 'dat' with # row names "Adam", "Anna", "Bill", "Berta", "Chris", "Cindy". # and column names "Age", "Gender", "Height". # Fill the columns with realistically looking random numbers # so someone could actually believe they are real data. # Use numeric codes 0 and 1 for gender. # Assume the ages to be uniformly distributed over 18:24. # Assume men's mean height is 68in and women's 4in less, # but both heights have the same sdev. # Finally assume normal distributions for heights, # except the numbers should be whole inches to look realistic. # ANSWER: dat <- matrix(NA, nrow=N, ncol=3) rownames(dat) <- c("Adam", "Anna", "Bill", "Berta", "Chris", "Cindy") colnames(dat) <- c("Age", "Gender", "Height") ## Several solutions for Age: dat[,"Age"] <- sample(18:24, size=nrow(dat), replace=T) dat[,"Age"] <- round(runif(nrow(dat), min=17.5, max=24.5)) dat[,"Age"] <- trunc(runif(nrow(dat), min=18, max=25)) dat[,"Age"] <- ceiling(runif(nrow(dat), min=17, max=24)) ## (What is remote possibility in the last three solutions?) ## Gender is implied by the names, hence not random: dat[,"Gender"] <- c(0,1,0,1,0,1) ## Alternatively, because the genders alternate in the above list: dat[,"Gender"] <- c(0,1) ## In general assumptions such as alternation should not be made, though. ## Two solutions for 'Height': dat[,"Height"] <- round(rnorm(nrow(dat), m=68,s=5)) - dat[,"Gender"]*4 dat[,"Height"] <- round(rnorm(nrow(dat), m=68-dat[,"Gender"]*4, s=5)) ## These are ways to express '4 inches less for females'. ## There is no prescribed sdev, so you pick one, e.g., s=5. ## You must round the numbers, though, because nobody reports ## heights to many digits. # PROBLEM 10: # Create an artificial data matrix with 100000 rows and # columns "Gender", "Age", "Graduated". # Assume P(female)=0.57 (code 1) and P(male)=0.43 (code 0). # Assume Age is uniformly distributed over 18:30 (integers). # Assume Graduated is a binary variable with codes 0/1 (no/yes). # Assume further the probabilities of having graduated are # a function of Age as follows: 0,.01,.1,.2,.7,.95,.99,.99,... # for ages 18 and up. # (Hint: To fill 'Graduated', create first a column 'prob' that for # each individual contains the probability of having graduated # dependent on 'Age'. This is the tricky part. # Then set 'Graduated' to runif(100000)