--- title: "Text as Data: Fundamentals" output: html_notebook author: Robert Stine date: July 2017 --- This notebook describes some basics of text modeling in R. In particular, the example highlights building a document-term matrix using the package `tm`. The example concludes by finding words associated with low and high values of a numerical variable, in this case, prices of wine. # Setup R I use these packages so frequently I will load them into R up front. Others that are occasionally useful will be loaded as needed. (Unlike `library`, `require` loads a package only if if it was not already present in the working environment.) ```{r} require(tm) require(tidyverse) require(stringr) ``` # Building a document corpus Start by reading the wine data. It is in a CSV file. The function `read_csv` creates a "tidy" data frame that, for example, does not convert text to factors by default. It comes from the `readr` package, part of `tidyverse`. (You will need to download these data to your computer and use the appropriate path.) ```{r} Wine <- read_csv("../data/Wine.csv") # I capitalize the names of data frames dim(Wine) ``` The function `read_csv` gets upset because it expects the alcohol variable to be integer valued, but then it discovers some decimal points. We can get `read_csv` to ignore this problem by telling it that values of the alcohol variable are doubles (or perhaps better fix those data values to remove the decimals). ```{r} Wine <- read_csv("../data/Wine.csv", col_types=cols(alcohol='d')) dim(Wine) ``` ```{r} summary(Wine) ``` Now focus on the column `description` that holds the tasting notes. ```{r} Wine$description[1:4] ``` ```{r} Wine$description[1:5] ``` The prices are a bit skewed. ```{r} Wine %>% ggplot(aes(x=price)) + geom_histogram() + scale_x_log10() ``` And differ slightly between red versus white wines. The comparison is easier to visualize with frequency polygons that are not filled in. ```{r} Wine %>% filter(!is.na(color)) %>% ggplot(aes(x=price, ..density.., color=color)) + geom_freqpoly() + scale_x_log10() ``` As usual with R, there are lots of alternative displays, such as side-by-side boxplots that tell a similar story. (`ggplot2` generates "pretty" graphs, but you might find the syntax overwhelming at first.) ```{r} boxplot(price ~ color, data=Wine, log='y') ``` The next task seems obvious, but has a lasting impact: What's a word? Is a number a word? What about punctuation? Are "subject" and "subjects" different words? `tm` has a collection of tools for taking care of these tasks, but you need to decide which to use. To get started, put the text into a `corpus` object. A corpus is usually created when using `tm` when a collection of documents is read into R. `tm` nicely handles many different types of documents, including PDF and Word files. ```{r} WineCorpus <- Corpus(VectorSource(Wine$description)) WineCorpus ``` A corpus object in `tm` is a decorated list, a list that has been adorned with special attributes when created. That means we can peek into the corpus as if it were a list by referring to the number elements of the list (which are the documents). ```{r} is.list(WineCorpus) ``` ```{r} WineCorpus[[1]] ``` To see the text itself, use the `inspect` command. ```{r} inspect(WineCorpus[[1]]) ``` Now comes the fun: tokenizing the text of the corpus. How should this text be represented as words? There are many choices. (BTW, `removeWords` only removes selected words, not them all!) ```{r} getTransformations() # defined in tm ``` If something you want to do is not among these -- or if you want finer control -- define your own content transformation by mimicking the following style. ```{r} # to convert a misspelled word toCorrect <- content_transformer(function(text, from, to) str_replace_all(text, from, to)) # to convert some pattern of text to a space toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " ")) # to convert text to lower case toLower <- content_transformer(function(text) tolower(text)) ``` Removing common stop words (such as "the", "a", "an", ...) is often done as well, using the function `removeWords`. Here's a sample of the stopwords that are defined in `tm`. (`tm` has collections of stopwords for other languages, so you have to pick "english" in this case.) ```{r} length(stopwords('english')) stopwords('english') # show just the first 10 ``` Stemming converts words to remove variations produced by plurals or adding tense to a base verb (trims off the trailing 's', 'es', resulting in a smaller vocabulary. Typically many of these transformations are applied, often with certain words in mind. ```{r} WineCorpus <- tm_map(WineCorpus, toLower) WineCorpus <- tm_map(WineCorpus, toCorrect, "wieght", "weight") WineCorpus <- tm_map(WineCorpus, toSpace, '-|/') # otherwise runs together WineCorpus <- tm_map(WineCorpus, removePunctuation) # might not be right (!) WineCorpus <- tm_map(WineCorpus, stripWhitespace) # WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) # leave for now # WineCorpus <- tm_map(WineCorpus, removeNumbers) # not many around # WineCorpus <- tm_map(WineCorpus, removeWords, c('yuck')) # specific word(s) ``` ```{r} inspect(WineCorpus[[1]]) ``` # Document term matrix The key object for our analysis is known as a document term matrix (or, when transposed, a term document matrix). It contains the counts of every word type in each document. Each of the 20,508 rows represents a document, and each of the 6,385 columns identifies a word type. Because most of the matrix entries are zeros, it is held in "sparse" format. (Notice that you *cannot* recover the source corpus from the document term matrix. This matrix represents each document as a "bag of words".) ```{r} dtm <- DocumentTermMatrix(WineCorpus) dim(dtm) ``` We can start to do statistics now -- counting. For example, all but 545,707 elements of the $20,508 \times 5,641 = 115,685,628 \approx 116$ million counts in the document term matrix are zero. (This is the count if you have not removed the stopwords; the number of types is smaller with the stopwords removed.) ```{r} dtm ``` It is now simple to use matrix functions from R to find the number of words in each document and the number of times each type appears (albeit at the cost of converting the sparse matrix into a dense matrix in order to use `rowSums` and `colSums`.). ```{r} ni <- rowSums(as.matrix(dtm)) # tokens in each document mj <- colSums(as.matrix(dtm)) # columns are named by the word types; frequency of each ``` Check a few of the terms to make sure that the data appear okay. If you don't spend time getting the data ready, you will find lots of issues. ```{r} j <- which.max(str_length(names(mj))) j names(mj)[j] ``` To see if this is real text, you have to find the one document that has this text. You can see that its just in one from the count for this type. ```{r} mj[j] ``` ```{r} which(0 != as.vector(dtm[,4149])) ``` Here's the relevant portion of the original source: "Creme brulee, blackberry,peppercorn, and mocha aromas. A soft, silky..." There's no space around that comma between "blackberry" and "peppercorn" and `tm` has collapsed the two words together. We could fix this by adding a comma to the list of punctuation to turn into spaces rather than just remove. It is hard to imagine a distribution of counts that is more skewed than the counts of the word types (left). ```{r} par(mfrow=c(1,2)) hist(mj, breaks=50, main="Counts of Word Types") hist(ni, breaks=50, main="Words per Document") ``` Even after taking logs, the counts remain skewed! This is common in text. "Tokens are common, but types are rare." ```{r} hist(log(mj), breaks=50, main="Counts of Word Types") ``` The frequency counts in `mj` are named and in alphabetical order. We can use these names to produce a bar graph *of the most common words* with `ggplot`. Stopwords such as "and" and "this" are common. ```{r} Freq <- data_frame(type = names(mj), count = mj) # ggplot and dplyr want data frames Freq %>% top_n(25, count) %>% mutate(type=reorder(type,count)) %>% # rather than alphabetical order ggplot(aes(type,count)) + geom_col() + coord_flip() ``` Let's see what happens without the stop words. (The following code *should* run, but dies a horrible death! I suspect R ran out of memory along the way. Instead, to remove the stopwords, use `tm_map` as shown above.) ```{r} Freq %>% filter(!type %in% stopwords('english')) %>% # be careful about syntax here or it will crash! top_n(25, count) %>% mutate(type=reorder(type,count)) %>% # rather than alphabetical order ggplot(aes(type,count)) + geom_col() + coord_flip() ``` This is a good chance to check whether the frequencies of the word types matches a Zipf distribution commonly associated with text. A Zipf distribution is characterized by a power law: the frequency of word types is inversely proportional to rank, $f_k \propto 1/k$. Said differently, the frequency of the second most common word is half that of the most common, the frequency of the third is one-third the most common, etc. A little algebra shows that for this to occur, then $\log p_k \approx b_0 - \log k$. That is, a plot of the log of the frequencies should be linear in the log of the rank $k$, with slope near -1. ```{r} Freq %>% arrange(desc(count)) %>% # decreasing by count mutate(rank=row_number()) %>% # add row number ggplot(aes(x=log(rank), y=log(count))) + geom_point() + # geom_smooth(method='lm', se=FALSE) + geom_abline(slope=-1, intercept=11, color='red') ``` The least squares slope (commented out or shown in blue) is steeper, being dominated by the many less common words. You can mitigate that effect by weighting the regression by the counts. ```{r} Temp <- Freq %>% mutate(rank=row_number()) lm(log(count) ~ log(rank), data=Temp, weights=sqrt(count)) ``` # Words and prices Before concluding this short introduction, lets relate the word types to prices. Which word types are associated with pricy wines, and which with cheaper wines? To find out, combine the information in the document term matrix with prices from the `Wine` data frame. First fix that weirdo price found using JMP earlier. ```{r} max(Wine$price, na.rm=TRUE) # missing values are contageous in R i <- which.max(Wine$price) # handles the NA by default i Wine$price[i] <- NA max(Wine$price, na.rm=TRUE) ``` To keep things manageable, consider words that appear, say, at least 10 times in the corpus. That's still 1,776 types. The matrix `counts` has the counts for each word type. We can find the word with the highest average price by turning these into 0/1 indicators and using a matrix product. ```{r} counts <- as.matrix(dtm[,9