--- title: "Text as Data Case Study: Presidential Inauguration Addresses" output: html_notebook author: Robert Stine date: July 2017 --- Possible questions of interest: * Trends over time: word length, sentences, paragraphs, or special words like "America". * Vocabulary size, average sentence length * Topics over time, such as religion or patriotism # R setup ```{r} require(tm) require(topicmodels) require(stringr) require(tidyverse) require(lubridate) # date handling source("text_utils.R") ``` # US Presidential Inauguration Addresses There have been 58 presidential inauguration addresses. These are available on the web site in the tar file ${\tt inauguration.tgz}$. Quite often in text analytics, each document will be kept in a separate file. You will need to join these files together to build an analysis. A quick peek at one of these text files shows they are in the following simple format: * First line: last name of president * Second line: city where address was given * Third line: date, as in April 30, 1789 Following these three lines, each of the following lines indicates a paragraph, which may be composed of one or more sentences. Blank lines separate the paragraphs. ```{r} path <- "~/data/text/inauguration" # change to suit your computer; download from web site files <- list.files(path) files ``` ```{r} first_address <- read_lines(file.path(path,"01_wash.txt")) # file.path joins path elements ``` ```{r} length(first_address) # lines ``` ```{r} first_address[1:3] ``` ```{r} first_address[4:6] ``` # Building a corpus (with metadata) Remove the blank lines. A small, utility function will be handy. ```{r} remove_blank_lines <- function(text) { n <- unname(sapply(text,str_count)) return(text[0'), nrow)) ``` ```{r} Addresses$text[1] ``` ```{r eval=FALSE} View(Addresses) ``` Plots are prettier now -- and perhaps more informative. ```{r} Addresses %>% mutate(`characters per paragraph`=n.char/n.paragraph) %>% # `` allows improper R name ggplot(aes(x=date, y=`characters per paragraph`)) + geom_line() + geom_point() ``` # Word frequencies and sentiments Build a document term matrix from the text of the addresses. The steps are the same as those used in the example of wine tasting notes. *1* Convert the text into a `tm` corpus *2* Tokenize the text by combining `tm_map` with `content_transformers` *3* Build the document-term matrix using `tm`'s function Several of these content transformers rely on regular expressions. * Replace dollar amounts by the symbol "". This will track mentions of money, without distinguising the amounts as separate types. * Convert years (a four digit number starting with 17, 18, 19, or 20) into one symbol. This retains the notion of a date, but without introducing distinct types. * Replace other numbers by the symbol "" (rather than removing them all). The order of these transformations matters. You would not find a dollar amount if you got rid of the numbers first or if you removed punctuation first (which would knock out the "$" sign). These transformations use regular expressions, as illustrated below. ```{r} str_replace("will cost $41,000,000 that ", "\\$[1-9][,0-9]*", "") # $ otherwise is EOL str_replace("the year 1789 in which we ", "[1-9][0-9]{3}", "") str_replace_all("one 1 two 33 555 digit", "[1-9][,0-9]*", "") ``` ```{r} AddressCorpus <- Corpus(VectorSource(Addresses$text)) replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to)) toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " ")) toLower <- content_transformer(function(text) tolower(text)) AddressCorpus <- tm_map(AddressCorpus, toLower) AddressCorpus <- tm_map(AddressCorpus, toSpace, '-|/') AddressCorpus <- tm_map(AddressCorpus, replace, "\\$[1-9][,0-9]*", "") AddressCorpus <- tm_map(AddressCorpus, replace, "[1-9][0-9]{3}", "") AddressCorpus <- tm_map(AddressCorpus, replace, "[1-9][,0-9]*", "") AddressCorpus <- tm_map(AddressCorpus, toSpace, "") AddressCorpus <- tm_map(AddressCorpus, removePunctuation) AddressCorpus <- tm_map(AddressCorpus, removeWords, stopwords("english")) AddressCorpus <- tm_map(AddressCorpus, stripWhitespace) ``` ```{r} inspect(AddressCorpus[[1]]) ``` The document-term matrix is not nearly so sparse as was that derived from the wine tasting notes. Fully 8% of the counts are *not* zero. The vocabulary is considerably larger. (The wine data had about 5,600 types.) ```{r} dtm <- DocumentTermMatrix(AddressCorpus) dtm ``` As in other examples, it ia useful to have the counts of tokens in documents and counts of the tokens for each type. ```{r} ni <- rowSums(as.matrix(dtm)) mj <- colSums(as.matrix(dtm)) ``` Out of curiousity, the longest term is (now that I fixed an errors in the source text!) ```{r} j <- which.max(str_length(names(mj))) j names(mj)[j] ``` # Exploring the vocabulary This plot shows the most common word types. ```{r} Freq <- tibble(type = names(mj), count = mj) Freq %>% top_n(25, count) %>% mutate(type=reorder(type,count)) %>% # rather than alphabetical order ggplot(aes(type,count)) + geom_col() + coord_flip() ``` Does this distribution change over time? For example, here are distributions from around the start of the 21th century compared to those from the 100 years before. To get these, notice that we have a DTM for the whole collection. We just need to pull out the counts for the appropriate periods using the dates held in the metadata of the `Addresses` data frame. There are fancy ways to do this, but this approach is easy to check. ```{r} i.21 <- which(mdy("1/1/1990") < Addresses$date) # lubridate i.21 mj.21 <- colSums(as.matrix(dtm[i.21,])) i.20 <- which((mdy("1/1/1890") < Addresses$date) & (Addresses$date < mdy("1/1/1920"))) i.20 mj.20 <- colSums(as.matrix(dtm[i.20,])) ``` Make a utility function for things you do frequently. ```{r} show_top_types <- function(named.counts, n) { tibble(type = names(named.counts), count = named.counts) %>% top_n(25, count) %>% mutate(type=reorder(type,count)) %>% # rather than alphabetical order ggplot(aes(type,count)) + geom_col() + coord_flip() } ``` ```{r} show_top_types(mj.20) ``` ```{r} show_top_types(mj.21) ``` Plots of the two sets of frequencies are interesting. ```{r} tibble(c21=mj.21, c20=mj.20, types=names(mj.20)) %>% ggplot(aes(c20, c21)) + geom_point(alpha=0.2) + geom_abline(color = "gray40", lty = 2) + geom_text(aes(label = types), check_overlap=TRUE, vjust=1.5) ``` How often does America (American) show up over time. Is there a trend? ```{r} american.words <- c('american', 'americans', 'america') n.amer <- rowSums(as.matrix(dtm[,american.words])) plot(Addresses$date,n.amer) ``` # LSA and topic models That's a lot of types and not many documents. This is quite a small data set for these methods. ```{r} dim(dtm) ``` Remove the words that are rare (rather than in this case replacing them with OOV). I will limit the analysis to words that appear in at least 5 addresses. ```{r} n.addr <- colSums(0 < as.matrix(dtm)) ``` ```{r} dtm.rs <- as.matrix(dtm[,5 < n.addr]) dim(dtm.rs) ``` Add the scaling. ```{r} mj.rs <- colSums(dtm.rs) ni.rs <- rowSums(dtm.rs) dtm.lsa <- dtm.rs / sqrt(ni.rs) dtm.lsa <- t( t(dtm.lsa)/sqrt(mj.rs) ) udv <- svd(dtm.lsa) ``` Again, not a clear cut off in the spectrum. ```{r} plot(udv$d, log='x') ``` ```{r} row.names(udv$v) <- names(mj.rs) plot_loadings(udv$v, 2,3, threshold=0.05, cex=0.7) ``` How about topic models? This looks promising. ```{r} n.topics <- 5 lda <- LDA(dtm.rs, n.topics, control = list(seed = 1234)) attributes(lda)$alpha ``` As in the wine example, common words that appear in most addresses show up in every topic. ```{r} terms(lda,15) ``` And the composition over time is interesting. The summary provided by topics is a bit too vague, just providing the rank order. Topics 3 and 5 were "popular" in early addresses, now its topic 4. ```{r} t(topics(lda,5)) ``` ```{r} theta <- attributes(lda)$gamma dim(theta) ``` The numerical proportions are more interesting. ```{r} round(theta,2) ``` ```{r} tibble(one=theta[,1], two=theta[,2], three=theta[,3], four=theta[,4], five=theta[,5], date=Addresses$date) %>% gather(one,two,three,four,five, key="Topic", value="Proportion") %>% ggplot(aes(date,Proportion)) + geom_point(aes(color=Topic, alpha=Proportion)) ```