############### HOMEWORK 3, STAT 541, DUE MON, SEPT 17, 2007 ############### # This homework is an exercise in manipulating character data. # The dataset is the dictionary of the English language borrowed # from Microsoft's spelling correction database. # Download the dataset 'dict.dat' from the class webpage # and read it into a vector 'dict': dict <- scan("dict.dat", what="", quo="") # (The named argument 'quo=...' makes sure that quotes in the dictionary # do not signal the start of a string # Hint: You will find the following functions useful. grep nchar paste sample sort strsplit substring table unlist #PROBLEM 0: sanity check # How many 'words' are in the dictionary? #ANSWER: #PROBLEM 1: data cleaning # A dictionary is supposed to contain a word only once. # We should never make the obvious assumptions but actually # check them. Tasks: # a) Check whether there are words appear more than once. # b) If so, remove them from 'dict' and check the length again. #ANSWER: #PROBLEM 2: # What are the 'words' that contain a single quote? #ANSWER: #PROBLEM 3: What is the longest word length in the dictionary? #ANSWER: #PROBLEM 4: Which are the words of maximal length and how many are there? # Do you recognize them? Why would Microsoft include such words? #ANSWER: #PROBLEM 5: What does this here do? sum(nchar(dict) == 25) #ANSWER: #PROBLEM 6: Generate a frequency table that tells how many words # there are of each length. How many words are there of length 29? #ANSWER: #PROBLEM 7: Write a for-loop that prints for lengths 1 to 20 one randomly # picked word that these lengths. Don't use 'print()'; use 'cat()' instead. #ANSWER: #PROBLEM 8: # We are now going to find all single-word anagrams of the English # language. An anagram is a pair of words or phrases that use the # same set of characters (with repetitions) or, in other words, they # are character permutations of each other. For example, "emanates" # and "manatees" form an anagram because they have the same set of # letters:"aaeemnst". # In this problem we take the first step towards the goal of finding # all single-word anagrams, by forming a new vector 'dict.srt' of the # same length as 'dict' that contains for each word in 'dict' the # sorted list of letters. That is, the entry 'manatees' in 'dict' # would have the corresponding entry 'aaeemnst' in 'dict.srt'. The # point of the vector is that if two words in 'dict' have the same # entries in 'dict.srt', they are anagrams of each other. # It might be good advice to choose any entry in 'dict' and develop # the necessary code for sorting the letters in it. Once it works, # simply loop over all entries of 'dict' and store the processed # result in the corresponding entry of 'dict.sr5'. Also, print a # message every 10000 words so you know the loop hasn't stalled. # Note: R doesn't have a character sorting function, but it has the # 'sort()' function that can sort not only number vectors numerically # but string vectors lexicographically as well. #ANSWER: #PROBLEM 9: # Find the ten largest anagram sets, that is, the character sets from # which the most anagrams can be formed. # To go about the problem, create a table that tells for each # character set how many anagrams can formed from it. # For example, the character set "cdeir" has 4 words from which to # form anagrams (they happend to be "cider" "cried" "dicer" "riced"); # the table should have a count of 4 associated with "cdeir". # Sort this table in decreasing order and call the result 'dict.srt.tab'. # Peeling off the first 10 elements gives you the 10 character sets from # which the most anagrams can be formed. Find a way to loop over these # 10 charactersets and list the anagram words for each. #ANSWER: #PROBLEM 10: Continuation of PROBLEMs 8 and 9. # What does the following code generate? dict.srt.siz <- dict.srt.tab[dict.srt] #ANSWER: #PROBLEM 11: # How many words are there that have no anagram, 1 anagram, # 2 anagrams, 3 anagrams,...? ##ANSWER: #PROBLEM 12: # What does output of the following code tell us? table(dict.srt.tab) # How does this output relate to the output of Problem 11? #ANSWER: #PROBLEM 13: This is unconnected to the anagram exercise above. # It is a simple exercise in building a dataframe: to generate a # dataframe that contains various characteristics of each words in 'dict'. # Form a dataframe 'dict.dfr' with as many rows as there are elements # in 'dict' and columns that contain the following: # col 1: the lengths of the words (numeric) # col 2: whether the word starts with a capital letter (locical) # col 3: whether the word starts with a lower case vowel # ("a","e","i","y","o","u") # col 4: whether the word ends with a letter "s" # Hint: Check out the binary operation '%in%'. # Give the columns the names 'nchar', 'cap', 'l.vow', 'end.s'. # Turn 'dict.dfr' into an associative data structure by assigning 'dict' # as rownames so that the following allows you to retrieve the above # variables for each word of the dictionary: dict.dfr[c("and","Andy","sets","dichlorodiphenyltrichloroethane"),] #ANSWER: