--- title: "Text as Data: Topic Models" output: html_notebook author: Robert Stine date: July 2017 --- # Setup R The methods in this notebook require the `topicmodels` package in addition to the standard list. (A paper accompanied the release of this package; see Grun and Hornik (2011) topicmodels: An R Package for Fitting Topic Models, J of Statistical Software, 40) ```{r} require(tm) require(topicmodels) require(stringr) require(tidyverse) source("text_utils.R") ``` # Simulation You can learn a lot about what a topic model does by using it to simulate text, which we can then study using methods covered previously. Here's a compact summary of the algorithm Given α and K topics, generate each document in the simulated corpus as follows... 1. Draw topic proportions θ | α ∼ Dir(α) for the document These are probabilities for sampling the words in the next step. 2. For each word w_i in the document (a) Draw topic assignment z_i | θ ∼ Mult(θ) z_i indicates topic (b) Draw word w_i|z_i,β_{1:K} ∼ Mult(β_{z_i}) β_{z_i} is topic dist To illustrate the role of the parameter alpha ($\alpha$) and how it affects the probabilities $\theta$ for a document, the function `rdirichlet` (from $\tt text\_utils.R$) simulates these draws. Each draw is a discrete probability distribution over the number of categories. The function has two arguments, alpha and the number of groups. Set the number of topics $K = 10$. If alpha is small, then the probability is concentrated in one topic. In this case, documents will be nearly pure, with all words drawn from a single topic, or perhaps 2. ```{r} alpha <- 0.05 K <- 10 # number of topics par(mfrow=c(2,2)) plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K)) plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K)) ``` As alpha increases, the distribution over topics becomes diffuse. ```{r} alpha <- 0.5 par(mfrow=c(2,2)) plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K)) plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K)) ``` ```{r} alpha <- 5 par(mfrow=c(2,2)) plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K)) plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K)) ``` Consequently, small values of $\alpha$ imply documents are pure, drawn from very few topics, whereas larger values of $\alpha$ indicate documents that are "blurry" and mix the topics. Another important characteristic of the model is how distinct the topics are themselves. Do topics have overlapping words, or are they mutually exclusive? For these simulations, the constant $\alpha_P$ controls this property. Small $\alpha_P$ means essentially distinct topics, larger values imply more common words. ```{r} n.vocab <- 1000 # size of vocabulary P <-matrix(0,nrow=K,ncol=n.vocab) # dist over words for each topic alpha.P <- 0.05 # small alpha implies less overlap [ 0.05 0.10 ] set.seed(6382) for(i in 1:K) P[i,] <- rdirichlet(alpha.P,n.vocab) P <- P[,order(colSums(P), decreasing=TRUE)] # sort so common types are first rowSums(P) # check that each sums to 1 ``` Here are some examples. (Using the square roots of the probabilities shows a bit more of the variation; without it there's a blob near zero.) ```{r} par(mfrow=c(1,2)) plot(P[1,], xlab="Vocabulary", ylab="Probability") # topic dist plot(sqrt(P[1,]),sqrt(P[2,]), # disjoint if alpha.P = 0.01, some common if .1 xlab=expression(sqrt("P"[1])),ylab=expression(sqrt("P"[2]))) ``` Now put labels on the "words". Low entropy words are those that by-and-large appear in only one topic. Higher entropy words appear in several and don't help much to identify the topic. (The function `entropy` is defined in $\tt text\_utils.R$.) ```{r} ent <- apply(P,2,entropy) # calc entropy for each word low.ent <- ent < 0.25 # low entropy -> predictable wordTypeTopic <- rep("Mix", ncol(P)) wordTypeTopic[low.ent] <- paste(apply(P[,low.ent],2,which.max)) # check assignment plot(ent, xlab="word types", ylab="entropy") abline(h=0.25, col='gray') ``` ```{r} wordTypeTopic[1:10] round(P[,1:10],2) ``` Now use these topic distributions to simulate documents from the topic model. ```{r} alpha <- 0.05 # mix of topics within documents n <- 2000 # number of documents theta <- matrix(0,nrow=n,ncol=K) # expected topic mix in each document Z <- matrix(0,nrow=n,ncol=K) # realized words from each topic avg.len <- 100 # avg document length set.seed(6783) doc.len <- sort(rpois(n,avg.len), decreasing=TRUE) # Poisson lengths for(i in 1:n) { theta[i,] <- rdirichlet(alpha, K) Z[i,] <- as.vector(rmultinom(1,doc.len[i],theta[i,])) # mix of topics } ``` Plot the mix of topics in some documents. ```{r} plot(theta[1,], xlab="Topic", ylab="Share of Vocabulary", main="Topic Mix for One Document") ``` Compute the document-term matrix. ```{r} C <- matrix(0,nrow=n, ncol=n.vocab) C.ev <- C; # expected value of C for(i in 1:n) { C.ev[i,] <- doc.len[i] * theta[i,] %*% P # prob distribution over words for each doc for(k in 1:K) if(Z[i,k]>0) C[i,]<-C[i,]+rmultinom(1,Z[i,k],P[k,]) } ``` Do we get something like a Zipf distribution? A bit more extreme concave shape. ```{r} mj <- colSums(C) zipf_plot(mj) ``` Let's see what LSA thinks of the simulated document-term matrix, with CCA scaling. Does a plot of the singular vales from the wine data resemble this one? ```{r} mj <- colSums(C) # freq of word types ni <- rowSums(C) # doc.len C.cca <- C / sqrt(ni) C.cca <- t(t(C) / sqrt(pmax(1,mj))) # avoid zero divisor udv.cca <- svd(C.cca) U <- udv.cca$u V <- udv.cca$v plot(udv.cca$d[1:100], log="xy", xlab="Component", ylab="Singular Value") ``` As in the LSA analysis of the wines, the leading singular vector is determined by the number of words in a document. ```{r} par(mfrow=c(1,2)) plot(sqrt(ni),-udv.cca$u[,1], xlab=expression(sqrt("n"["i"])),ylab=expression("U"[1])) plot(sqrt(mj),-udv.cca$v[,1], xlab=expression(sqrt("m"["j"])),ylab=expression("V"[1])) ``` The other components are more interesting and different from those seen in the LSA of the wine ratings. ```{r} pairs(U[,2:5], labels=c(expression("U"[2]), expression("U"[3]), expression("U"[4]), expression("U"[5]))) ``` ```{r} j <- 2; k <- j+1; rownames(V) <- wordTypeTopic plot_loadings(V, 2, 3, 0.07,cex=1.0) plot_loadings(V, 4, 5, 0.07,cex=1.0) ``` # Fitting topic models: simulated data Let's see how well methods from `topicmodels` recover the structure in the simulated topics. The software is simple to run. The code sometimes runs better if you are willing to tell it what to use for $\alpha$ (the parameter that controls the purity of documents). You need to pick the number of topics. Set the `seed` argument to get reproducibility. ```{r} n.topics = 10 colnames(C) <- wordTypeTopic lda <- LDA(C, n.topics, control = list(seed = 1234)) # random initialization lda ``` The properties of the estimated model are held in `attributes` of the `lda` object. ```{r} names(attributes(lda)) ``` It estimated the value of $\alpha$ quite well (though we don't have a standard error on that). ```{r} alpha attributes(lda)$alpha ``` Beta holds *logs* of the probability distributions over the word types. The first row of beta, for example, estimates the distribution over types that go into the first topic. These provide estimates of the probability distributions $P_k$. Notice that the indices are arbitrary: the estimated model may label $P_1$ as the 5th topic. ```{r} beta <- attributes(lda)$beta dim(beta) colnames(beta) <- attributes(lda)$terms # the types ``` Here's an example. Based on the words with higher probabililty, the first estimated probability distribution evidently estimates $P_7$ and the second estimates $P_3$. ```{r} p1 <- exp(beta[1,]) sum(p1) sort(p1, decreasing=TRUE)[1:7] ``` ```{r} p2 <- exp(beta[2,]) sum(p2) sort(p2, decreasing=TRUE)[1:7] ``` Those are good estimates of the distributions. ```{r} par(mfrow=c(1,2)) plot(p1, P[7,]); abline(a=0,b=1,col='gray') plot(p2, P[3,]); abline(a=0,b=1,col='gray') ``` The component `gamma` estimates the mix of topics in documents, the parameter denoted $\theta$ in the simulation and slides. It's easy to get confused with so many parameters, so use the shapes of the estimates to keep you on the right track. There has to be an dimension in `gamma` that goes with the number of rows of the data (rows of the DTM). We have 2000 documents and 10 topics -- matching the shape of `gamma`. ```{r} theta.hat <- attributes(lda)$gamma dim(theta.hat) # mix of topics over documents ``` When looking at these, remember that the first topic is "really" the seventh in the simulation ($P_7$), and that the second topic corresponds to $P_3$. ```{r} round(theta.hat[1:10,],2) # for first 10 documents ``` How well do the topic assignments for these documents correspond to the simulated topics? To answer, consider document 7 in this table. The simulation parameters for this document show that its very pure, coming from the third topic -- which is exactly what the topic model estimates. (Again, this is tricky. The second estimated topic is really the third true topic.) ```{r} round(theta[7,],3) ``` Computing the matrix of correlations shows you which is which: look for a correlation near 1 in a row or column. For instance, $P_1$ is obviously estimated in the 9th position, and $P_2$ in the 8th. $P_3$ is the second estimate, as found previously. $P_4$ is less clear, but evidently estimated by the 7th -- but neither $P_4$ nor $P_{10}$ are well determined. ```{r} r <- cor(theta, theta.hat) rownames(r) <- paste("P",1:K) colnames(r) <- paste("Est",1:K) round(r,2) ``` The helper function `topics` shows the most likely topics in each document. You pick how many topics to see (but it does not show you the weights on these). ```{r} x <- topics(lda, 2) dim(x) ``` ```{r} x[,1:10] ``` The helper `terms` shows the word types in each topic. Again, no weights. This function will be more useful when the word types are more interesting. ```{r} x <- terms(lda,5) dim(x) ``` ```{r} x ``` ```{r} names(attributes(lda)) ``` ```{r} ll <- attributes(lda)$loglikelihood str(ll) ``` For this simulated example (in which the topics are well separated), `topicmodels` is able to separate them nicely. How, though, would we know to estimate 10 topics. In practice, you don't and have to fit several and compare the fits using the overall log-likelihood of the estimated model. The log-likelihood attribute gives you the contribution to the overall log-likelihood from each document. (You can use this sometimes to spot the odd case. None are particularly unusual here -- because these are simulated.) ```{r} ll <- attributes(lda)$loglikelihood length(ll) ``` ```{r} hist(ll) ``` Add these up, or just call the function `logLik`. The `logLik` function adds them up for you *and* gives a version of degrees of freedom which counts the size of the model. ```{r} sum(ll) logLik(lda) ``` To find the number of topics, compare the log-likelihoods for several fits with different numbers of topics. Negative log-likelihoods are analogous to residual sums of squares in regression. As in that situation (like $R^2$), the log-likelihood improves with model size/complexity. The package authors use cross-validation to pick the number of topics. (It was not clear how they did that, however.) ```{r} n.topics <- 5 lda5 <- LDA(C, n.topics, control = list(seed = 1234)) logLik(lda5) ``` ```{r} n.topics <- 8 lda8 <- LDA(C, n.topics, control = list(seed = 1234)) logLik(lda8) ``` ```{r} n.topics <- 9 lda9 <- LDA(C, n.topics, control = list(seed = 1234)) logLik(lda9) ``` ```{r} n.topics <- 11 lda11 <- LDA(C, n.topics, control = list(seed = 1234)) logLik(lda11) ``` # Fitting topic models: real data Like LSA, topic models take a bag-of-words approach, so we begin by preparing the document-term matrix. ```{r} Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double())) dim(Wine) ``` This time remove the stop words. ```{r} WineCorpus <- Corpus(VectorSource(Wine$description)) replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to)) toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " ")) toLower <- content_transformer(function(text) tolower(text)) WineCorpus <- tm_map(WineCorpus, toLower) WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight") WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.') # avoid run-in words WineCorpus <- tm_map(WineCorpus, removePunctuation) WineCorpus <- tm_map(WineCorpus, stripWhitespace) WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) ``` Now compute the document term matrix and the row `ni` and column `mj` marginal totals. This DTM is a little smaller that prior examples computed from the wine reviews, with fewer types (5,412) than in the LSA analysis. ```{r} dtm <- DocumentTermMatrix(WineCorpus) dtm ni <- rowSums(as.matrix(dtm)) mj <- colSums(as.matrix(dtm)) word.types <- names(mj) # for convenience and clarity ``` The corpus consists of 490,734 tokens (compared to 607,335 tokens with those stop words included). ```{r} sum(as.matrix(dtm)) ``` Rare types remain and will be replaced by the symbol "OOV" as in the LSA analysis. Stopwords generally are far too common to be in this collection, so these counts are similar to the prior analysis with stopwords. ```{r} sum(mj==1) sum(mj==2) sum(mj==3) ``` The following commands are the same as in the LSA analysis. ```{r} o <- order(mj, decreasing=TRUE) # biggest to smallest dtm <- dtm[,o] # permute the columns mj <- mj[o] dtm.oov <- dtm[,10 <= mj] dtm.oov <- cbind(as.matrix(dtm.oov), rowSums(as.matrix(dtm[,mj < 10]))) names.oov <- c(names(mj[10<=mj]), 'OOV') mj.oov <- c(mj[10<=mj],sum(mj[mj<10])) ni.oov <- ni colnames(dtm.oov) <- names.oov names(mj.oov) <- names.oov dim(dtm.oov) ``` We need to tell the software how many topics to fit. Because the algoritm uses a randomized procedure to initialize estimates (such as for the topic distributions), set the seed to be able to reproduce the results. It may also be useful to set the parameter $\alpha$ that controls the "document complexity". ```{r} n.topics = 8 # Why 8? Eight format nicely on the output! lda <- LDA(dtm.oov, n.topics, control = list(seed = 1234)) # lda <- LDA(dtm.oov, n.topics, control = list(alpha=0.1, seed = 1234)) # set alpha lda ``` This estimate seeems too large, but let's see what we got. (Setting a smaller value of `alpha` seems a bit artificial and forces the software to fit a model it does not "like" -- and didn't seem to help much.) ```{r} attributes(lda)$alpha ``` Use the functions `terms` and `topics` to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics. ```{r} terms(lda,10) ``` And `topics` gives the composition of the reviews, showing the leading topics in each document. Topic 10 seems to be the "everything" topic. ```{r} x <- t(topics(lda,3)) x[1:10,] ``` It would seem there are too many "common" words that have overwhelmed the topic modeling. One simple thing to do in this case is remove from the analysis words that show up in almost every document. ```{r} C <- as.matrix(dtm.oov) dim(C) ``` ```{r} prop <- colSums(0 < C)/nrow(C) tibble(prop=prop) %>% ggplot(aes(prop)) + geom_histogram() + scale_x_log10(breaks=c(0.01,0.05,0.25,1)) ``` ```{r} sort(prop, decreasing=TRUE)[1:20] ``` Let's try topic modeling without those that appear in more than 25% of the documents. ```{r} C <- C[,order(prop, decreasing=TRUE)] C <- C[,-(1:15)] colnames(C)[1:4] ``` ```{r} n.topics = 8 lda <- LDA(C, n.topics, control = list(seed = 1234)) lda ``` The estimate of $\alpha$ remains large, but let's see what we got this time. ```{r} attributes(lda)$alpha ``` Use the functions `terms` and `topics` to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics. Again, the topics are dominated by the common words. ```{r} terms(lda,10) ``` So I will get more aggressive and wipe out more of these common words. ```{r} prop <- colSums(0 < C)/nrow(C) sum(0.10 < prop) ``` ```{r} names(prop[0.10 < prop]) ``` ```{r} C <- C[,-(1:33)] ``` Check to make sure I've not wipe out an entire document. ```{r} any(rowSums(C)==0) ``` Okay, now fit the topic model. ```{r} n.topics = 8 lda <- LDA(C, n.topics, control = list(seed = 1234)) attributes(lda)$alpha ``` ```{r} terms(lda,10) ``` Perhaps better, but it seems there just aren't very many well-separated topics in these reviews. So look for fewer, like 2. ```{r} n.topics = 2 lda <- LDA(C, n.topics, control = list(seed = 1234)) attributes(lda)$alpha ``` ```{r} terms(lda,10) ``` Perhaps we should not be too surprised that topic models don't seem well matched to these data. Recall the differences between the spectrum of the simulated data (the number of topics was clear) to that of the wine data. There just is not a clearly defined set of separated topics.