--- title: "Text as Data: Vector Space Models" output: html_notebook author: Robert Stine date: July 2017 --- This notebook illustrates the use of vector space methods in R. These manipulate the document-term matrix and can be used to find *word embeddings*. ] # Setup R The methods in this notebook add another package to the standard list. ```{r} require(tm) require(wordcloud) require(stringr) # not in the tidyverse require(tidyverse) source("text_utils.R") # from web page ``` # Prepare document-term matrix Read the wine data from its CSV file. Rather than do this every time, it is generally a good idea to save the "processed" data in a ".sav" file. ```{r} Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double())) dim(Wine) ``` ```{r} WineCorpus <- Corpus(VectorSource(Wine$description)) replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to)) toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " ")) toLower <- content_transformer(function(text) tolower(text)) WineCorpus <- tm_map(WineCorpus, toLower) WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight") WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.') # otherwise runs together; dot is special regex WineCorpus <- tm_map(WineCorpus, removePunctuation) WineCorpus <- tm_map(WineCorpus, stripWhitespace) # WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) # leave for now ``` Now compute the document term matrix and the row `ni` and column `mj` marginal counts. The DTM is a little smaller, with fewer types -- 5,488 -- here than in the first slides because of handling the comma differently. We will be making it smaller still. ```{r} dtm <- DocumentTermMatrix(WineCorpus) dtm ni <- rowSums(as.matrix(dtm)) mj <- colSums(as.matrix(dtm)) word.types <- names(mj) # for convenience and clarity ``` As usual, check the name of the longest type for possible errors. This one is okay. ```{r} word.types[j <- which.max(str_length(word.types))] ``` The corpus consists of 607,335 tokens. ```{r} sum(as.matrix(dtm)) sum(mj) sum(ni) ``` Many tokens represents rare types. ```{r} sum(mj==1) sum(mj==2) sum(mj==3) ``` ```{r} sum(mj[3% top_n(25,frequency) %>% mutate(word=reorder(word, frequency)) %>% ggplot(aes(word,frequency)) + geom_col() + coord_flip() ``` You can also draw word clouds to summzarize the most common types; eye candy can be useful to attract attention (though it makes it difficult to compare the frequencies... quick, which is the 5th most common word). Don't try to show too many words. Removing stop words would be very useful in this case. ```{r} require(wordcloud) set.seed(133) # random locations; fix the seed to be able to reproduce wordcloud(names(mj), mj, max.words=50) ``` The function `zipf_plot` from the helper file ${\tt text\_utils.R}$ shows the Zipf plot. By default, it fits a least squares lines to the first 250 frequencies. ```{r} zipf_plot(mj) ``` # Handling rare words Spectral methods rely on word co-occurences within some context. The context might be defined by adjacency (bigrams, n-grams) or, in this case, appearing in the same wine review. To make it simpler to find the common types from the rare types, sort the DTM by the type frequencies. This calculation uses the function `order`. Here's an example. `order` returns the indices that will sort an object. ```{r} x <- c('d','f','a','c','b') o <- order(x); o x[o] ``` Now apply `order` to the frequencies of the word types. ```{r} o <- order(mj, decreasing=TRUE) # biggest to smallest dtm <- dtm[,o] # permute the columns mj <- mj[o] ``` Now the first types are the most common types. ```{r} mj[1:10] ``` Here are some of the smaller types. If you explore the less common types, you will discover a mixture of interesting words (given that these are wine reviews) along with some junk (such as misspelled words, typos, numbers, and dates). ```{r} mj[length(mj)-9:0] ``` We are not going to learn much about the usage of words that appear so seldom, so set these all to the symbol OOV, short for out-of-vocabulary. We don't need to do that in the text itself, just in the document term matrix. I will keep the types that appear at least 10 times (as JMP used). That reduces the matrix to 1,742 types. (BTW, `tm` includes the function `removeSparseTerms` that will wipe out the rare terms from the DTM. I don't want to wipe them out; I want to consolidate them.) ```{r} dtm.oov <- dtm[,10 <= mj] dtm.oov ``` Now what to do about the OOVs. Maybe having a lot of them tells us something about the other words? I'll append a column that includes the number of these in each document. Doing so turns the document-term matrix into a regular numerical matrix, but that's okay -- we need such a matrix for computing the SVD. ```{r} dtm.oov <- cbind(as.matrix(dtm.oov), rowSums(as.matrix(dtm[,mj < 10]))) dim(dtm.oov) ``` ```{r} names.oov <- c(names(mj[10<=mj]), 'OOV') mj.oov <- c(mj[10<=mj],sum(mj[mj<10])) ni.oov <- ni # the same as it was colnames(dtm.oov) <- names.oov names(mj.oov) <- names.oov ``` Now check we have not lost any terms. ```{r} sum(dtm.oov) sum(mj.oov) sum(ni.oov) ``` # Singular value decomposition and latent semantic analysis We will need a regular R matrix in the following calculations, so convert a special sparse matrix of counts into a regular R matrix using `as.matrix`. The matrix `dtm.oov` is already in the needed form. The SVD and latent semantic analysis rely on associations of words in documents: which words appear together in the same document. We can look at correlations individually to see what sort of information lies in these associations. We can explore the associations between word types with the `tm` function `findAssoc`. ```{r} findAssocs(dtm,'zinfandel', corlimit=0.1) ``` ```{r} findAssocs(dtm,'pinot', corlimit=0.1) ``` Paso Robles is a place, so the two words "paso" and "robles" always occur together. Hence, the correlation of these types is 1. ```{r} findAssocs(dtm,'paso', corlimit=0.1) ``` Before computing the singular value decomposition (SVD), it is a good idea to do some normalization. The "correct" normalization is rather time consuming (requiring the *inverse* of a large matrix, which is tough to do in big-data situations), but the following approximation works nicely in practice. The calculation of the SVD itself is also rather slow. It's a simple function call in R, but one that takes a very long time to complete. In the interest of time, we'll compute the SVD using a subset of 3,000 documents. In a bit, I will show you a magic trick that scales better. ```{r} set.seed(2373) # so can reproduce i.svd <- sample(nrow(dtm.oov), 3000) dtm.svd <- dtm.oov[i.svd,] ni.svd <- rowSums(dtm.svd) # number of words in a document, its length mj.svd <- pmax(1,colSums(dtm.svd)) # frequency of word type in vocabulary (avoid 0 divisor) ``` The minimum frequency in the original DTM is 10. ```{r} min(mj.oov) ``` After sampling, however, it sometimes is zero. (That's why `pmax` is used above... we don't want a zero divisor.) ```{r} min(mj.svd) # even though we started with at leat ``` This normalization is *almost* a standardization to correlations... but a lot faster. ```{r} dtm.svd <- dtm.svd/sqrt(ni.svd) # take advantage of R behavior dtm.svd <- t( t(dtm.svd)/sqrt(mj.svd) ) ``` Now we can compute the SVD of the scaled matrix of counts, $C_{ij}/\sqrt{n_i m_j}$. ```{r} udv <- svd(dtm.svd) # returns u, d, v names(udv) length(udv$d) # singular values udv$d[1:4] # normalization implies first = 1 ``` Remember to combine variables in order to use `ggplot`. ```{r} tibble(i=1:50, d=udv$d[1:50]) %>% ggplot(aes(i,d)) + geom_point() + scale_x_log10() + labs(title="Spectrum with CCA Scaling", x="Dimension", y="Singular Value") ``` The rows of the $U$ matrix identify documents, so think of these as "new variables" that describe the documents. The rows of $V$ identify word types and so provide an "interpretation" of the $U_j$ columns. The elements of $U$ sometimes reveal clusters of related documents, whereas the elements of $V$ reveal the components of these new variables. As is regular principal components analysis, the first component in these data captures the number of words (which happens to be an important variable in the wine analysis). ```{r} plot(ni[i.svd], udv$u[,1], xlab="Number of Tokens", ylab=expression("U"[1])) ``` In this example, clusters are easy to see. I don't like the various alternatives to `pairs` offered by the Tidy collection, so its back to regular R graphics. (Why not: suppose I don't want the colors to be shown in the matrix of plots, or I'd like to use expressions to label the variables?) ```{r} set.seed(234) ii <- sample(nrow(udv$u), 500) # fewer points pairs(udv$u[ii,2:5], labels=c(expression("U"[2]), expression("U"[3]), # subscripts in plot label expression("U"[4]), expression("U"[5])) ) ``` It would be neat if those clusters were associated with the colors of the wines. The following plot shows that is not the case. ```{r} color <- tolower(Wine$color[i.svd][ii]) color[color=='white'] <- 'gold' # white does not show up well! pairs(udv$u[ii,2:5], col=color, labels=c(expression("U"[2]), expression("U"[3]), # subscripts in plot label expression("U"[4]), expression("U"[5])) ) ``` Although the obvious clusters defined by these new variables do not correspond to the wine color, it is easy to see how we can use $U_4$ and perhaps $U_5$ to separate red from white wines. Rather than speculate, we can simply fit a logistic regression. ```{r} j <- 1:10 U <- udv$u[,j] y <- ifelse(Wine$color[i.svd] == 'Red',1,0) lregr <- glm(y ~ U, family=binomial, na.action=na.exclude) summary(lregr) ``` These predictions separate the wine types nicely. ```{r} data_frame(fit=fitted.values(lregr), color=Wine$color[i.svd]) %>% ggplot(aes(x=fit,color=color)) + geom_freqpoly() ``` But what are these variables that so clearly separate the wines? Once you see the names of the components of $V_4$ it becomes clear why this variable works to separate the two types of wines. First attach names. Notice that the names of the *rows* of $V$ match the names of the columns of $C$, the document term matrix. ```{r} V <- udv$v[,1:25] rownames(V) <- colnames(dtm.svd) ``` Now draw a plot of the "loadings" (as they are called in factor analysis). ```{r} plot(V[,4], V[,5], xlab= expression("V"[4]), ylab=expression("V"[5]), col='gray') label <- 0.05 < sqrt(V[,4]^2 + V[,5]^2) # pick those far from the origin text(V[label,4], V[label,5], rownames(V)[label], cex=0.8, srt=45) ``` The function `plot_loadings` from the collection of R scripts in the file $\tt text\_utils.R$ automates this task. For example, there's little evident reason to think that $V_2$ and $V_3$ would help distinguish the wine's color. ```{r} plot_loadings(V, 2, 3, threshold=0.05, cex=0.8) ``` Before looking at scaling these methods up to larger amounts of data, what are those obvious clusters in the data? They are not related to color, and skimming the key words, don't seem related to the variety either (e.g., cabernet versus zinfandel). This plot reveals the answer. Gee, what do you think happened? ```{r} plot(i.svd, U[,2], xlab="Document Position", ylab=expression("U"[2])) ``` Here are several descriptions in the "early" phase of the data. ```{r} Wine$description[i.svd[U[,2]< -0.025]][1:5] ``` And some of those from the other period. Do you notice any differences in the "style" of the review? ```{r} Wine$description[i.svd[U[,2]> 0.025]][1:5] ``` # Random projection We could use these variables to predict the colors of the other wines, but it would be nice to be able to compute the SVD for all of the data rather than just these 3000. We can approximate the SVD of the entire document-term matrix using the method known as *random projection*. The key to the method is that if you only want, say 300 singular vectors, you can find them via random matrix multiplications. The function `random_projection_svd` (also in the $\tt text\_utils.R$ file) implements this algorithm. To try to convince you it's not magic, let's compute the SVD via random projection and compare the results to what we get from `svd`. Random projection is quite a bit faster than the built-in function. The more power iterations the better, particulary for the terms associated with smaller singular values. ```{r} set.seed(4843) # results depend on seed rp_udv <- random_projection_svd(dtm.svd, d=50, power.iter=3, verbose=TRUE) ``` Like `svd`, this functions returns a named list of results. ```{r} names(rp_udv) ``` The larger singular values almost match those from the "exact" calculation. ```{r} plot(rp_udv$d, udv$d[1:50], log='xy',main="Singular Values", xlab="Random Projection", ylab="Exact") abline(a=0,b=1) ``` As do the principal components. ```{r} plot(rp_udv$u[,4], udv$u[,4], main="Principal Component", xlab="Random Projection", ylab="Exact") abline(a=0,b=1) ``` And the labels. ```{r} plot(rp_udv$v[,4], udv$v[,4], main="Loadings", xlab="Random Projection", ylab="Exact") abline(a=0,b=1) ``` To get the full decomposition, use `dtm.oov` rather than the sampled version (`dtm.svd`). ```{r} set.seed(4843) # results depend on seed rp_udv <- random_projection_svd(dtm.oov, d=50, power.iter=3, verbose=TRUE) ```