Setup R

The methods in this notebook require the topicmodels package in addition to the standard list. (A paper accompanied the release of this package; see Grun and Hornik (2011) topicmodels: An R Package for Fitting Topic Models, J of Statistical Software, 40)

require(tm)
Loading required package: tm
Loading required package: NLP
require(topicmodels)
Loading required package: topicmodels
require(stringr)
Loading required package: stringr
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter():   dplyr, stats
lag():      dplyr, stats
source("text_utils.R")

Simulation

You can learn a lot about what a topic model does by using it to simulate text, which we can then study using methods covered previously. Here’s a compact summary of the algorithm

Given α and K topics, generate each document in the simulated corpus as follows… 1. Draw topic proportions θ | α ∼ Dir(α) for the document These are probabilities for sampling the words in the next step. 2. For each word w_i in the document (a) Draw topic assignment z_i | θ ∼ Mult(θ) z_i indicates topic (b) Draw word w_i|z_i,β_{1:K} ∼ Mult(β_{z_i}) β_{z_i} is topic dist

To illustrate the role of the parameter alpha (\(\alpha\)) and how it affects the probabilities \(\theta\) for a document, the function rdirichlet (from \(\tt text\_utils.R\)) simulates these draws.

Each draw is a discrete probability distribution over the number of categories. The function has two arguments, alpha and the number of groups. Set the number of topics \(K = 10\). If alpha is small, then the probability is concentrated in one topic. In this case, documents will be nearly pure, with all words drawn from a single topic, or perhaps 2.

alpha  <- 0.05
K      <- 10    # number of topics
par(mfrow=c(2,2))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))

As alpha increases, the distribution over topics becomes diffuse.

alpha  <- 0.5
par(mfrow=c(2,2))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))

alpha  <- 5
par(mfrow=c(2,2))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))

Consequently, small values of \(\alpha\) imply documents are pure, drawn from very few topics, whereas larger values of \(\alpha\) indicate documents that are “blurry” and mix the topics.

Another important characteristic of the model is how distinct the topics are themselves. Do topics have overlapping words, or are they mutually exclusive? For these simulations, the constant \(\alpha_P\) controls this property. Small \(\alpha_P\) means essentially distinct topics, larger values imply more common words.

n.vocab <- 1000                     # size of vocabulary
P <-matrix(0,nrow=K,ncol=n.vocab)   # dist over words for each topic
alpha.P <- 0.05                     # small alpha implies less overlap  [ 0.05 0.10 ]
set.seed(6382)
for(i in 1:K) P[i,] <- rdirichlet(alpha.P,n.vocab)
P <- P[,order(colSums(P), decreasing=TRUE)]         # sort so common types are first
rowSums(P)                          # check that each sums to 1
 [1] 1 1 1 1 1 1 1 1 1 1

Here are some examples. (Using the square roots of the probabilities shows a bit more of the variation; without it there’s a blob near zero.)

par(mfrow=c(1,2))
    plot(P[1,], xlab="Vocabulary", ylab="Probability")  # topic dist
    plot(sqrt(P[1,]),sqrt(P[2,]),                       # disjoint if alpha.P = 0.01, some common if .1
        xlab=expression(sqrt("P"[1])),ylab=expression(sqrt("P"[2])))        

Now put labels on the “words”. Low entropy words are those that by-and-large appear in only one topic. Higher entropy words appear in several and don’t help much to identify the topic. (The function entropy is defined in \(\tt text\_utils.R\).)

ent <- apply(P,2,entropy)                   # calc entropy for each word
low.ent <- ent < 0.25                       # low entropy -> predictable
wordTypeTopic <- rep("Mix", ncol(P))
wordTypeTopic[low.ent] <- paste(apply(P[,low.ent],2,which.max))
#   check assignment
plot(ent, xlab="word types", ylab="entropy")
abline(h=0.25, col='gray')

wordTypeTopic[1:10]
 [1] "Mix" "Mix" "Mix" "Mix" "Mix" "Mix" "2"   "Mix" "10"  "10" 
round(P[,1:10],2)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 0.05 0.00 0.07 0.01 0.05 0.01 0.00 0.00 0.00  0.00
 [2,] 0.02 0.01 0.00 0.00 0.00 0.00 0.08 0.00 0.00  0.00
 [3,] 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0.00
 [4,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0.00
 [5,] 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00  0.00
 [6,] 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00  0.00
 [7,] 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00  0.00
 [8,] 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0.00
 [9,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00  0.00
[10,] 0.00 0.02 0.00 0.00 0.00 0.07 0.00 0.06 0.07  0.07

Now use these topic distributions to simulate documents from the topic model.

alpha <- 0.05                       # mix of topics within documents
n     <- 2000                       # number of documents
theta <- matrix(0,nrow=n,ncol=K)    # expected topic mix in each document
Z     <- matrix(0,nrow=n,ncol=K)    # realized words from each topic
avg.len <- 100                      # avg document length
set.seed(6783)
doc.len <- sort(rpois(n,avg.len), decreasing=TRUE)  # Poisson lengths
for(i in 1:n) {                     
    theta[i,] <- rdirichlet(alpha, K)       
    Z[i,] <- as.vector(rmultinom(1,doc.len[i],theta[i,])) # mix of topics
}

Plot the mix of topics in some documents.

plot(theta[1,], xlab="Topic", ylab="Share of Vocabulary", main="Topic Mix for One Document")

Compute the document-term matrix.

C <- matrix(0,nrow=n, ncol=n.vocab)  
C.ev <- C;                              # expected value of C
for(i in 1:n) { 
    C.ev[i,] <- doc.len[i] * theta[i,] %*% P # prob distribution over words for each doc        
    for(k in 1:K) if(Z[i,k]>0) C[i,]<-C[i,]+rmultinom(1,Z[i,k],P[k,])
}

Do we get something like a Zipf distribution? A bit more extreme concave shape.

mj <- colSums(C)
zipf_plot(mj)
44 y values <= 0 omitted from logarithmic plot

Call:
lm(formula = ly ~ lx, data = df[1:min(n.fit, nrow(df)), ])

Coefficients:
(Intercept)           lx  
     8.5047      -0.4912  

Let’s see what LSA thinks of the simulated document-term matrix, with CCA scaling. Does a plot of the singular vales from the wine data resemble this one?

mj <- colSums(C)        # freq of word types
ni <- rowSums(C)        # doc.len
    
C.cca  <-  C / sqrt(ni)
C.cca  <-  t(t(C) / sqrt(pmax(1,mj)))  # avoid zero divisor
udv.cca <- svd(C.cca)
U <- udv.cca$u  
V <- udv.cca$v
        
plot(udv.cca$d[1:100], log="xy", xlab="Component", ylab="Singular Value")

As in the LSA analysis of the wines, the leading singular vector is determined by the number of words in a document.

par(mfrow=c(1,2))
    plot(sqrt(ni),-udv.cca$u[,1], xlab=expression(sqrt("n"["i"])),ylab=expression("U"[1]))
    plot(sqrt(mj),-udv.cca$v[,1], xlab=expression(sqrt("m"["j"])),ylab=expression("V"[1]))

The other components are more interesting and different from those seen in the LSA of the wine ratings.

pairs(U[,2:5], 
      labels=c(expression("U"[2]), expression("U"[3]),  
               expression("U"[4]), expression("U"[5])))

j <- 2; k <- j+1;  
rownames(V) <- wordTypeTopic
plot_loadings(V, 2, 3, 0.07,cex=1.0)

plot_loadings(V, 4, 5, 0.07,cex=1.0)

Fitting topic models: simulated data

Let’s see how well methods from topicmodels recover the structure in the simulated topics. The software is simple to run. The code sometimes runs better if you are willing to tell it what to use for \(\alpha\) (the parameter that controls the purity of documents). You need to pick the number of topics. Set the seed argument to get reproducibility.

n.topics = 10
colnames(C) <- wordTypeTopic
lda <- LDA(C, n.topics, control = list(seed = 1234))
lda
A LDA_VEM topic model with 10 topics.

The properties of the estimated model are held in attributes of the lda object.

names(attributes(lda))
 [1] "control"         "alpha"           "call"            "Dim"             "k"              
 [6] "terms"           "documents"       "beta"            "gamma"           "wordassignments"
[11] "loglikelihood"   "iter"            "logLiks"         "n"               "class"          

It estimated the value of \(\alpha\) quite well (though we don’t have a standard error on that).

alpha
[1] 0.05
attributes(lda)$alpha
[1] 0.05355063

Beta holds logs of the probability distributions over the word types. The first row of beta, for example, estimates the distribution over types that go into the first topic. These provide estimates of the probability distributions \(P_k\). Notice that the indices are arbitrary: the estimated model may label \(P_1\) as the 5th topic.

beta <- attributes(lda)$beta
dim(beta)
[1]   10 1000
colnames(beta) <- attributes(lda)$terms   # the types

Here’s an example. Based on the words with higher probabililty, the first estimated probability distribution evidently estimates \(P_7\) and the second estimates \(P_3\).

p1 <- exp(beta[1,])
sum(p1)
[1] 1
sort(p1, decreasing=TRUE)[1:7]
       Mix        Mix          7          7          7        Mix        Mix 
0.05832417 0.05818662 0.05565437 0.04518013 0.03825313 0.03473340 0.03080214 
p2 <- exp(beta[2,])
sum(p2)
[1] 1
sort(p2, decreasing=TRUE)[1:7]
       Mix          3        Mix        Mix        Mix          3        Mix 
0.12531119 0.07178982 0.04802098 0.04178264 0.03468565 0.03186862 0.02593108 

Those are good estimates of the distributions.

par(mfrow=c(1,2))
    plot(p1, P[7,]); abline(a=0,b=1,col='gray')
    plot(p2, P[3,]); abline(a=0,b=1,col='gray')

The component gamma estimates the mix of topics in documents, the parameter denoted \(\theta\) in the simulation and slides. It’s easy to get confused with so many parameters, so use the shapes of the estimates to keep you on the right track. There has to be an dimension in gamma that goes with the number of rows of the data (rows of the DTM). We have 2000 documents and 10 topics – matching the shape of gamma.

theta.hat <- attributes(lda)$gamma
dim(theta.hat)              # mix of topics over documents
[1] 2000   10

When looking at these, remember that the first topic is “really” the seventh in the simulation (\(P_7\)), and that the second topic corresponds to \(P_3\).

round(theta.hat[1:10,],2)   # for first 10 documents
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 0.00 0.00 0.11 0.00 0.00 0.88 0.00 0.00 0.00  0.00
 [2,] 0.24 0.00 0.00 0.34 0.00 0.08 0.13 0.00 0.20  0.00
 [3,] 0.04 0.00 0.00 0.00 0.00 0.00 0.58 0.00 0.29  0.09
 [4,] 0.39 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00  0.58
 [5,] 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.28  0.69
 [6,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.02  0.95
 [7,] 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0.00
 [8,] 0.00 0.01 0.00 0.00 0.03 0.00 0.00 0.00 0.00  0.96
 [9,] 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0.40
[10,] 0.42 0.00 0.00 0.33 0.01 0.23 0.00 0.00 0.00  0.00

How well do the topic assignments for these documents correspond to the simulated topics? To answer, consider document 7 in this table. The simulation parameters for this document show that its very pure, coming from the third topic – which is exactly what the topic model estimates. (Again, this is tricky. The second estimated topic is really the third true topic.)

round(theta[7,],3)
 [1] 0.000 0.000 0.996 0.004 0.000 0.000 0.000 0.000 0.000 0.000

Computing the matrix of correlations shows you which is which: look for a correlation near 1 in a row or column. For instance, \(P_1\) is obviously estimated in the 9th position, and \(P_2\) in the 8th. \(P_3\) is the second estimate, as found previously. \(P_4\) is less clear, but evidently estimated by the 7th – but neither \(P_4\) nor \(P_{10}\) are well determined.

r <- cor(theta, theta.hat)
rownames(r) <- paste("P",1:K)
colnames(r) <- paste("Est",1:K)
round(r,2)
     Est 1 Est 2 Est 3 Est 4 Est 5 Est 6 Est 7 Est 8 Est 9 Est 10
P 1  -0.12 -0.11 -0.11 -0.07 -0.12 -0.12 -0.17 -0.11  1.00  -0.13
P 2  -0.11 -0.10 -0.11 -0.06 -0.12 -0.11 -0.18  1.00 -0.10  -0.12
P 3  -0.10  1.00 -0.09 -0.03 -0.12 -0.09 -0.15 -0.10 -0.12  -0.10
P 4  -0.11 -0.11 -0.11 -0.07 -0.11 -0.12  0.70 -0.13 -0.11  -0.12
P 5  -0.10 -0.09  1.00 -0.06 -0.10 -0.10 -0.16 -0.11 -0.11  -0.12
P 6  -0.10 -0.10 -0.12 -0.08 -0.11 -0.13 -0.17 -0.13 -0.13   1.00
P 7   0.96 -0.09 -0.10  0.57 -0.10 -0.12 -0.17 -0.11 -0.12  -0.11
P 8  -0.11 -0.10 -0.10 -0.07 -0.11  1.00 -0.17 -0.11 -0.12  -0.13
P 9  -0.09 -0.12 -0.10 -0.06  1.00 -0.11 -0.16 -0.11 -0.12  -0.11
P 10 -0.11 -0.09 -0.11 -0.06 -0.10 -0.10  0.63 -0.11 -0.13  -0.10

The helper function topics shows the most likely topics in each document. You pick how many topics to see (but it does not show you the weights on these).

x <- topics(lda, 2)
dim(x)
[1]    2 2000
x[,1:10]

The helper terms shows the word types in each topic. Again, no weights. This function will be more useful when the word types are more interesting.

x <- terms(lda,5)
dim(x)
[1]  5 10
x
     Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
[1,] "Mix"   "Mix"   "Mix"   "Mix"   "9"     "Mix"   "Mix"   "2"     "Mix"   "6"     
[2,] "Mix"   "3"     "5"     "Mix"   "Mix"   "8"     "10"    "Mix"   "1"     "6"     
[3,] "7"     "Mix"   "5"     "7"     "Mix"   "Mix"   "10"    "Mix"   "1"     "Mix"   
[4,] "7"     "Mix"   "5"     "Mix"   "Mix"   "Mix"   "Mix"   "Mix"   "Mix"   "6"     
[5,] "7"     "Mix"   "Mix"   "Mix"   "Mix"   "Mix"   "10"    "2"     "Mix"   "Mix"   
names(attributes(lda))
 [1] "control"         "alpha"           "call"            "Dim"             "k"              
 [6] "terms"           "documents"       "beta"            "gamma"           "wordassignments"
[11] "loglikelihood"   "iter"            "logLiks"         "n"               "class"          
ll <- attributes(lda)$loglikelihood
str(ll)
 num [1:2000] -639 -732 -755 -668 -669 ...

For this simulated example (in which the topics are well separated), topicmodels is able to separate them nicely.

How, though, would we know to estimate 10 topics. In practice, you don’t and have to fit several and compare the fits using the overall log-likelihood of the estimated model. The log-likelihood attribute gives you the contribution to the overall log-likelihood from each document. (You can use this sometimes to spot the odd case. None are particularly unusual here – because these are simulated.)

ll <- attributes(lda)$loglikelihood
length(ll)
[1] 2000
hist(ll)

Add these up, or just call the function logLik. The logLik function adds them up for you and gives a version of degrees of freedom which counts the size of the model.

sum(ll)
[1] -994639
logLik(lda)
'log Lik.' -994639 (df=10001)

To find the number of topics, compare the log-likelihoods for several fits with different numbers of topics. Negative log-likelihoods are analogous to residual sums of squares in regression. As in that situation (like \(R^2\)), the log-likelihood improves with model size/complexity. The package authors use cross-validation to pick the number of topics. (It was not clear how they did that, however.)

n.topics <- 5
lda5 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda5)
'log Lik.' -1072540 (df=5001)
n.topics <- 8
lda8 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda8)
'log Lik.' -1013195 (df=8001)
n.topics <- 9
lda9 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda9)
'log Lik.' -994021.6 (df=9001)
n.topics <- 11
lda11 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda11)
'log Lik.' -975162.9 (df=11001)

Fitting topic models: real data

Like LSA, topic models take a bag-of-words approach, so we begin by preparing the document-term matrix.

Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
[1] 20508    14

This time remove the stop words.

WineCorpus <- Corpus(VectorSource(Wine$description))
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.')     # avoid run-in words
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) 

Now compute the document term matrix and the row ni and column mj marginal totals. This DTM is a little smaller that prior examples computed from the wine reviews, with fewer types (5,412) than in the LSA analysis.

dtm <- DocumentTermMatrix(WineCorpus)
dtm
<<DocumentTermMatrix (documents: 20508, terms: 5412)>>
Non-/sparse entries: 475966/110513330
Sparsity           : 100%
Maximal term length: 15
Weighting          : term frequency (tf)
ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))
word.types <- names(mj)   # for convenience and clarity

The corpus consists of 490,734 tokens (compared to 607,335 tokens with those stop words included).

sum(as.matrix(dtm))
[1] 490734

Rare types remain and will be replaced by the symbol “OOV” as in the LSA analysis. Stopwords generally are far too common to be in this collection, so these counts are similar to the prior analysis with stopwords.

sum(mj==1)
[1] 1819
sum(mj==2)
[1] 656
sum(mj==3)
[1] 365

The following commands are the same as in the LSA analysis.

o <- order(mj, decreasing=TRUE)   # biggest to smallest
dtm <- dtm[,o]                    # permute the columns
mj <- mj[o]
dtm.oov <- dtm[,10 <= mj]
dtm.oov <- cbind(as.matrix(dtm.oov), rowSums(as.matrix(dtm[,mj < 10])))
names.oov  <- c(names(mj[10<=mj]), 'OOV')
mj.oov <- c(mj[10<=mj],sum(mj[mj<10]))
ni.oov <- ni                         
colnames(dtm.oov) <- names.oov
names(mj.oov) <- names.oov
dim(dtm.oov)
[1] 20508  1692

We need to tell the software how many topics to fit. Because the algoritm uses a randomized procedure to initialize estimates (such as for the topic distributions), set the seed to be able to reproduce the results. It may also be useful to set the parameter \(\alpha\) that controls the “document complexity”.

n.topics = 8  # Why 8?  Eight format nicely on the output!
lda <- LDA(dtm.oov, n.topics, control = list(seed = 1234))
lda
A LDA_VEM topic model with 8 topics.

This estimate seeems too large, but let’s see what we got. (Setting a smaller value of alpha seems a bit artificial and forces the software to fit a model it does not “like” – and didn’t seem to help much.)

attributes(lda)$alpha
[1] 49.71126

Use the functions terms and topics to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics.

terms(lda,10)

And topics gives the composition of the reviews, showing the leading topics in each document. Topic 10 seems to be the “everything” topic.

x <- t(topics(lda,3))
x[1:10,]

It would seem there are too many “common” words that have overwhelmed the topic modeling. One simple thing to do in this case is remove from the analysis words that show up in almost every document.

C <- as.matrix(dtm.oov)
dim(C)
[1] 20508  1692
prop <- colSums(0 < C)/nrow(C)
tibble(prop=prop) %>%
    ggplot(aes(prop)) + geom_histogram() + scale_x_log10(breaks=c(0.01,0.05,0.25,1))

sort(prop, decreasing=TRUE)[1:20]
   aromas    medium    finish     entry      body     fruit      full    bodied     leads 
0.9166179 0.8060269 0.5691437 0.4498732 0.4443632 0.4083772 0.3791691 0.3772674 0.3201677 
  acidity    fruity       OOV    palate       dry      fade  finishes   tannins    cherry 
0.3039789 0.2927638 0.2888141 0.2728691 0.2705286 0.2618490 0.2496587 0.2477082 0.2396626 
    apple    follow 
0.2253755 0.2106007 

Let’s try topic modeling without those that appear in more than 25% of the documents.

C <- C[,order(prop, decreasing=TRUE)]
C <- C[,-(1:15)]
colnames(C)[1:4]
[1] "finishes" "tannins"  "cherry"   "apple"   
n.topics = 8  
lda <- LDA(C, n.topics, control = list(seed = 1234))
lda
A LDA_VEM topic model with 8 topics.

The estimate of \(\alpha\) remains large, but let’s see what we got this time.

attributes(lda)$alpha
[1] 39.18128

Use the functions terms and topics to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics. Again, the topics are dominated by the common words.

terms(lda,10)
      Topic 1   Topic 2    Topic 3   Topic 4    Topic 5    Topic 6    Topic 7   Topic 8
 [1,] "ripe"    "cherry"   "cherry"  "spice"    "tannins"  "cherry"   "flavors" "apple"
 [2,] "dried"   "apple"    "yet"     "sweet"    "apple"    "tart"     "tannins" "good" 
 [3,] "cherry"  "finishes" "tart"    "finishes" "good"     "sweet"    "oak"     "oak"  
 [4,] "tart"    "follow"   "light"   "tangy"    "light"    "dryish"   "black"   "light"
 [5,] "berry"   "ripe"     "tannins" "berry"    "flavors"  "follow"   "dried"   "soft" 
 [6,] "oak"     "tangy"    "tangy"   "round"    "finishes" "tannins"  "earth"   "pear" 
 [7,] "pepper"  "tart"     "crisp"   "mineral"  "dried"    "finishes" "follow"  "ripe" 
 [8,] "yet"     "pepper"   "spice"   "soft"     "crisp"    "accented" "depth"   "spice"
 [9,] "flavors" "oak"      "lemon"   "flavors"  "oak"      "yet"      "pear"    "dried"
[10,] "apple"   "pear"     "supple"  "nice"     "rich"     "roasted"  "wine"    "lemon"

So I will get more aggressive and wipe out more of these common words.

prop <- colSums(0 < C)/nrow(C)
sum(0.10 < prop)
[1] 33
names(prop[0.10 < prop])
 [1] "finishes" "tannins"  "cherry"   "apple"    "follow"   "flavors"  "yet"      "tart"    
 [9] "dried"    "oak"      "spice"    "good"     "light"    "tangy"    "soft"     "ripe"    
[17] "supple"   "rich"     "round"    "black"    "sweet"    "nice"     "crisp"    "depth"   
[25] "notes"    "pear"     "wine"     "dryish"   "lemon"    "balanced" "earth"    "berry"   
[33] "citrus"  
C <- C[,-(1:33)]

Check to make sure I’ve not wipe out an entire document.

any(rowSums(C)==0)
[1] FALSE

Okay, now fit the topic model.

n.topics = 8  
lda <- LDA(C, n.topics, control = list(seed = 1234))
attributes(lda)$alpha
[1] 31.51749
terms(lda,10)
      Topic 1      Topic 2         Topic 3      Topic 4     Topic 5   Topic 6      Topic 7    
 [1,] "slightly"   "lively"        "pair"       "silky"     "pepper"  "peach"      "pepper"   
 [2,] "moderately" "concentration" "accented"   "spicy"     "earthy"  "silky"      "accented" 
 [3,] "lively"     "cedar"         "pepper"     "mineral"   "cedar"   "moderately" "vanilla"  
 [4,] "chocolate"  "earthy"        "tannin"     "custard"   "skin"    "touch"      "chocolate"
 [5,] "pie"        "tannin"        "long"       "melon"     "mineral" "spicy"      "baked"    
 [6,] "cedar"      "herbal"        "melon"      "baked"     "lively"  "apricot"    "roasted"  
 [7,] "long"       "nicely"        "moderately" "vanilla"   "creamy"  "accented"   "nut"      
 [8,] "red"        "varietal"      "firm"       "roasted"   "baked"   "peel"       "great"    
 [9,] "pineapple"  "touch"         "earthy"     "pineapple" "red"     "baked"      "delicate" 
[10,] "nut"        "drink"         "well"       "green"     "currant" "varietal"   "slightly" 
      Topic 8    
 [1,] "firm"     
 [2,] "chocolate"
 [3,] "spicy"    
 [4,] "accented" 
 [5,] "earthy"   
 [6,] "long"     
 [7,] "peel"     
 [8,] "pie"      
 [9,] "plum"     
[10,] "cedar"    

Perhaps better, but it seems there just aren’t very many well-separated topics in these reviews. So look for fewer, like 2.

n.topics = 2 
lda <- LDA(C, n.topics, control = list(seed = 1234))
attributes(lda)$alpha
[1] 39.9038
terms(lda,10)
      Topic 1      Topic 2        
 [1,] "spicy"      "concentration"
 [2,] "chocolate"  "lively"       
 [3,] "accented"   "earthy"       
 [4,] "baked"      "skin"         
 [5,] "moderately" "cedar"        
 [6,] "table"      "tannin"       
 [7,] "pepper"     "moderate"     
 [8,] "vanilla"    "mineral"      
 [9,] "silky"      "touch"        
[10,] "long"       "pepper"       

Perhaps we should not be too surprised that topic models don’t seem well matched to these data. Recall the differences between the spectrum of the simulated data (the number of topics was clear) to that of the wine data. There just is not a clearly defined set of separated topics.

---
title: "Text as Data: Topic Models"
output: html_notebook
author: Robert Stine
date: July 2017
---

# Setup R

The methods in this notebook require the `topicmodels` package in addition to the standard list.  (A paper accompanied the release of this package; see Grun and Hornik (2011) topicmodels: An R Package for Fitting Topic Models, J of Statistical Software, 40)

```{r}
require(tm)
require(topicmodels)

require(stringr)
require(tidyverse)

source("text_utils.R")
```


# Simulation

You can learn a lot about what a topic model does by using it to simulate text, which we can then study using methods covered previously.  Here's a compact summary of the algorithm

Given α and K topics, generate each document in the simulated corpus as follows...
    1. Draw topic proportions θ | α ∼ Dir(α) for the document
       These are probabilities for sampling the words in the next step.
    2. For each word w_i in the document
		(a) Draw topic assignment z_i | θ ∼ Mult(θ)		   z_i indicates topic
		(b) Draw word w_i|z_i,β_{1:K} ∼ Mult(β_{z_i})	   β_{z_i} is topic dist	

To illustrate the role of the parameter alpha ($\alpha$) and how it affects the probabilities $\theta$ for a document, the function `rdirichlet` (from $\tt text\_utils.R$) simulates these draws.  

Each draw is a discrete probability distribution over the number of categories.  The function has two arguments, alpha and the number of groups.  Set the number of topics $K = 10$.  If alpha is small, then the probability is concentrated in one topic.  In this case, documents will be nearly pure, with all words drawn from a single topic, or perhaps 2.

```{r}
alpha  <- 0.05
K      <- 10	# number of topics
par(mfrow=c(2,2))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
```

As alpha increases, the distribution over topics becomes diffuse.  

```{r}
alpha  <- 0.5
par(mfrow=c(2,2))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
```

```{r}
alpha  <- 5
par(mfrow=c(2,2))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
    plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
```

Consequently, small values of $\alpha$ imply documents are pure, drawn from very few topics, whereas larger values of $\alpha$ indicate documents that are "blurry" and mix the topics. 

Another important characteristic of the model is how distinct the topics are themselves.  Do topics have overlapping words, or are they mutually exclusive?  For these simulations, the constant $\alpha_P$ controls this property.  Small $\alpha_P$ means essentially distinct topics, larger values imply more common words.

```{r}
n.vocab <- 1000 		            # size of vocabulary
P <-matrix(0,nrow=K,ncol=n.vocab)   # dist over words for each topic

alpha.P <- 0.05   			        # small alpha implies less overlap  [ 0.05 0.10 ]
set.seed(6382)
for(i in 1:K) P[i,] <- rdirichlet(alpha.P,n.vocab)
P <- P[,order(colSums(P), decreasing=TRUE)]		    # sort so common types are first

rowSums(P)                          # check that each sums to 1
```

Here are some examples.  (Using the square roots of the probabilities shows a bit more of the variation; without it there's a blob near zero.)

```{r}
par(mfrow=c(1,2))
    plot(P[1,], xlab="Vocabulary", ylab="Probability")	# topic dist
    plot(sqrt(P[1,]),sqrt(P[2,]),		                # disjoint if alpha.P = 0.01, some common if .1
		xlab=expression(sqrt("P"[1])),ylab=expression(sqrt("P"[2])))     	
```

Now put labels on the "words".  Low entropy words are those that by-and-large appear in only one topic.  Higher entropy words appear in several and don't help much to identify the topic.  (The function `entropy` is defined in $\tt text\_utils.R$.)

```{r}
ent <- apply(P,2,entropy)                   # calc entropy for each word

low.ent <- ent < 0.25						# low entropy -> predictable
wordTypeTopic <- rep("Mix", ncol(P))
wordTypeTopic[low.ent] <- paste(apply(P[,low.ent],2,which.max))

#	check assignment
plot(ent, xlab="word types", ylab="entropy")
abline(h=0.25, col='gray')
```

```{r}
wordTypeTopic[1:10]
round(P[,1:10],2)
```

Now use these topic distributions to simulate documents from the topic model.

```{r}
alpha <- 0.05				        # mix of topics within documents
n     <- 2000						# number of documents
theta <- matrix(0,nrow=n,ncol=K)	# expected topic mix in each document
Z	  <- matrix(0,nrow=n,ncol=K)	# realized words from each topic
avg.len <- 100			  			# avg document length

set.seed(6783)
doc.len <- sort(rpois(n,avg.len), decreasing=TRUE)  # Poisson lengths
for(i in 1:n) {						
	theta[i,] <- rdirichlet(alpha, K)		
	Z[i,] <- as.vector(rmultinom(1,doc.len[i],theta[i,])) # mix of topics
}
```

Plot the mix of topics in some documents.

```{r}
plot(theta[1,], xlab="Topic", ylab="Share of Vocabulary", main="Topic Mix for One Document")
```

Compute the document-term matrix.

```{r}
C <- matrix(0,nrow=n, ncol=n.vocab)	 
C.ev <- C;								# expected value of C
for(i in 1:n) {	
	C.ev[i,] <- doc.len[i] * theta[i,] %*% P # prob distribution over words for each doc		
	for(k in 1:K) if(Z[i,k]>0) C[i,]<-C[i,]+rmultinom(1,Z[i,k],P[k,])
}
```

Do we get something like a Zipf distribution?  A bit more extreme concave shape.

```{r}
mj <- colSums(C)
zipf_plot(mj)
```

Let's see what LSA thinks of the simulated document-term matrix, with CCA scaling.  Does a plot of the singular vales from the wine data resemble this one?

```{r}
mj <- colSums(C)		# freq of word types
ni <- rowSums(C)		# doc.len
	
C.cca  <-  C / sqrt(ni)
C.cca  <-  t(t(C) / sqrt(pmax(1,mj)))  # avoid zero divisor

udv.cca <- svd(C.cca)

U <- udv.cca$u	
V <- udv.cca$v
		
plot(udv.cca$d[1:100], log="xy", xlab="Component", ylab="Singular Value")
```

As in the LSA analysis of the wines, the leading singular vector is determined by the number of words in a document.

```{r}
par(mfrow=c(1,2))
    plot(sqrt(ni),-udv.cca$u[,1], xlab=expression(sqrt("n"["i"])),ylab=expression("U"[1]))
	plot(sqrt(mj),-udv.cca$v[,1], xlab=expression(sqrt("m"["j"])),ylab=expression("V"[1]))
```

The other components are more interesting and different from those seen in the LSA of the wine ratings.

```{r}
pairs(U[,2:5], 
      labels=c(expression("U"[2]), expression("U"[3]), 	
               expression("U"[4]), expression("U"[5])))
```

```{r}
j <- 2; k <- j+1;  

rownames(V) <- wordTypeTopic

plot_loadings(V, 2, 3, 0.07,cex=1.0)
plot_loadings(V, 4, 5, 0.07,cex=1.0)
```

# Fitting topic models: simulated data

Let's see how well methods from `topicmodels` recover the structure in the simulated topics. The software is simple to run.  The code sometimes runs better if you are willing to tell it what to use for $\alpha$ (the parameter that controls the purity of documents).  You need to pick the number of topics.  Set the `seed` argument to get reproducibility.

```{r}
n.topics = 10
colnames(C) <- wordTypeTopic

lda <- LDA(C, n.topics, control = list(seed = 1234))  # random initialization
lda
```

The properties of the estimated model are held in `attributes` of the `lda` object.

```{r}
names(attributes(lda))
```


It estimated the value of $\alpha$ quite well (though we don't have a standard error on that).

```{r}
alpha
attributes(lda)$alpha
```

Beta holds *logs* of the probability distributions over the word types.  The first row of beta, for example, estimates the distribution over types that go into the first topic.  These provide estimates of the probability distributions $P_k$.  Notice that the indices are arbitrary: the estimated model may label $P_1$ as the 5th topic.

```{r}
beta <- attributes(lda)$beta
dim(beta)

colnames(beta) <- attributes(lda)$terms   # the types
```

Here's an example.  Based on the words with higher probabililty, the first estimated probability distribution evidently estimates $P_7$ and the second estimates $P_3$.

```{r}
p1 <- exp(beta[1,])
sum(p1)
sort(p1, decreasing=TRUE)[1:7]
```

```{r}
p2 <- exp(beta[2,])
sum(p2)
sort(p2, decreasing=TRUE)[1:7]
```

Those are good estimates of the distributions.

```{r}
par(mfrow=c(1,2))
    plot(p1, P[7,]); abline(a=0,b=1,col='gray')
    plot(p2, P[3,]); abline(a=0,b=1,col='gray')
```

The component `gamma` estimates the mix of topics in documents, the parameter denoted $\theta$ in the simulation and slides.  It's easy to get confused with so many parameters, so use the shapes of the estimates to keep you on the right track.  There has to be an dimension in `gamma` that goes with the number of rows of the data (rows of the DTM).  We have 2000 documents and 10 topics -- matching the shape of `gamma`.

```{r}
theta.hat <- attributes(lda)$gamma
dim(theta.hat)				# mix of topics over documents
```

When looking at these, remember that the first topic is "really" the seventh in the simulation ($P_7$), and that the second topic corresponds to $P_3$.

```{r}
round(theta.hat[1:10,],2)	# for first 10 documents
```

How well do the topic assignments for these documents correspond to the simulated topics? To answer, consider document 7 in this table.  The simulation parameters for this document show that its very pure, coming from the third topic -- which is exactly what the topic model estimates.  (Again, this is tricky.  The second estimated topic is really the third true topic.)

```{r}
round(theta[7,],3)
```

Computing the matrix of correlations shows you which is which: look for a correlation near 1 in a row or column. For instance, $P_1$ is obviously estimated in the 9th position, and $P_2$ in the 8th.  $P_3$ is the second estimate, as found previously.  $P_4$ is less clear, but evidently estimated by the 7th -- but neither $P_4$ nor $P_{10}$ are well determined.

```{r}
r <- cor(theta, theta.hat)
rownames(r) <- paste("P",1:K)
colnames(r) <- paste("Est",1:K)
round(r,2)
```

The helper function `topics` shows the most likely topics in each document. You pick how many topics to see (but it does not show you the weights on these).

```{r}
x <- topics(lda, 2)
dim(x)
```

```{r}
x[,1:10]
```

The helper `terms` shows the word types in each topic.  Again, no weights.  This function will be more useful when the word types are more interesting.

```{r}
x <- terms(lda,5)
dim(x)
```
```{r}
x
```


```{r}
names(attributes(lda))
```


```{r}
ll <- attributes(lda)$loglikelihood
str(ll)
```

For this simulated example (in which the topics are well separated), `topicmodels` is able to separate them nicely.  

How, though, would we know to estimate 10 topics.  In practice, you don't and have to fit several and compare the fits using the overall log-likelihood of the estimated model.  The log-likelihood attribute gives you the contribution to the overall log-likelihood from each document.  (You can use this sometimes to spot the odd case.  None are particularly unusual here -- because these are simulated.)

```{r}
ll <- attributes(lda)$loglikelihood
length(ll)
```

```{r}
hist(ll)
```

Add these up, or just call the function `logLik`.  The `logLik` function adds them up for you *and* gives a version of degrees of freedom which counts the size of the model. 

```{r}
sum(ll)
logLik(lda)
```

To find the number of topics, compare the log-likelihoods for several fits with different numbers of topics.  Negative log-likelihoods are analogous to residual sums of squares in regression.  As in that situation (like $R^2$), the log-likelihood improves with model size/complexity.  The package authors use cross-validation to pick the number of topics. (It was not clear how they did that, however.)

```{r}
n.topics <- 5
lda5 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda5)
```

```{r}
n.topics <- 8
lda8 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda8)
```
```{r}
n.topics <- 9
lda9 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda9)
```
```{r}
n.topics <- 11
lda11 <- LDA(C, n.topics, control = list(seed = 1234)) 
logLik(lda11)
```


# Fitting topic models: real data

Like LSA, topic models take a bag-of-words approach, so we begin by preparing the document-term matrix.

```{r}
Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
```

This time remove the stop words.

```{r}
WineCorpus <- Corpus(VectorSource(Wine$description))

replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))

WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.')     # avoid run-in words
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) 
```

Now compute the document term matrix and the row `ni` and column `mj` marginal totals. This DTM is a little smaller that prior examples computed from the wine reviews, with fewer types (5,412) than in the LSA analysis.  

```{r}
dtm <- DocumentTermMatrix(WineCorpus)
dtm

ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))

word.types <- names(mj)   # for convenience and clarity
```

The corpus consists of 490,734 tokens (compared to 607,335 tokens with those stop words included).

```{r}
sum(as.matrix(dtm))
```

Rare types remain and will be replaced by the symbol "OOV" as in the LSA analysis.  Stopwords generally are far too common to be in this collection, so these counts are similar to the prior analysis with stopwords.

```{r}
sum(mj==1)
sum(mj==2)
sum(mj==3)
```

The following commands are the same as in the LSA analysis.

```{r}
o <- order(mj, decreasing=TRUE)   # biggest to smallest
dtm <- dtm[,o]                    # permute the columns
mj <- mj[o]

dtm.oov <- dtm[,10 <= mj]
dtm.oov <- cbind(as.matrix(dtm.oov), rowSums(as.matrix(dtm[,mj < 10])))
names.oov  <- c(names(mj[10<=mj]), 'OOV')

mj.oov <- c(mj[10<=mj],sum(mj[mj<10]))
ni.oov <- ni                         

colnames(dtm.oov) <- names.oov
names(mj.oov) <- names.oov

dim(dtm.oov)
```

We need to tell the software how many topics to fit.  Because the algoritm uses a randomized procedure to initialize estimates (such as for the topic distributions), set the seed to be able to reproduce the results.  It may also be useful to set the parameter $\alpha$ that controls the "document complexity".

```{r}
n.topics = 8  # Why 8?  Eight format nicely on the output!

lda <- LDA(dtm.oov, n.topics, control = list(seed = 1234))
# lda <- LDA(dtm.oov, n.topics, control = list(alpha=0.1, seed = 1234))  # set alpha

lda
```

This estimate seeems too large, but let's see what we got.  (Setting a smaller value of `alpha` seems a bit artificial and forces the software to fit a model it does not "like" -- and didn't seem to help much.)

```{r}
attributes(lda)$alpha
```

Use the functions `terms` and `topics` to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics.

```{r}
terms(lda,10)
```

And `topics` gives the composition of the reviews, showing the leading topics in each document.  Topic 10 seems to be the "everything" topic.

```{r}
x <- t(topics(lda,3))
x[1:10,]
```

It would seem there are too many "common" words that have overwhelmed the topic modeling.  One simple thing to do in this case is remove from the analysis words that show up in almost every document.

```{r}
C <- as.matrix(dtm.oov)
dim(C)
```


```{r}
prop <- colSums(0 < C)/nrow(C)
tibble(prop=prop) %>%
    ggplot(aes(prop)) + geom_histogram() + scale_x_log10(breaks=c(0.01,0.05,0.25,1))
```

```{r}
sort(prop, decreasing=TRUE)[1:20]
```

Let's try topic modeling without those that appear in more than 25% of the documents.

```{r}
C <- C[,order(prop, decreasing=TRUE)]
C <- C[,-(1:15)]
colnames(C)[1:4]
```

```{r}
n.topics = 8  

lda <- LDA(C, n.topics, control = list(seed = 1234))
lda
```

The estimate of $\alpha$ remains large, but let's see what we got this time.

```{r}
attributes(lda)$alpha
```

Use the functions `terms` and `topics` to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics.  Again, the topics are dominated by the common words.

```{r}
terms(lda,10)
```

So I will get more aggressive and wipe out more of these common words.

```{r}
prop <- colSums(0 < C)/nrow(C)
sum(0.10 < prop)
```
```{r}
names(prop[0.10 < prop])
```

```{r}
C <- C[,-(1:33)]
```

Check to make sure I've not wipe out an entire document.

```{r}
any(rowSums(C)==0)
```

Okay, now fit the topic model.

```{r}
n.topics = 8  

lda <- LDA(C, n.topics, control = list(seed = 1234))
attributes(lda)$alpha
```

```{r}
terms(lda,10)
```

Perhaps better, but it seems there just aren't very many well-separated topics in these reviews.  So look for fewer, like 2.

```{r}
n.topics = 2 

lda <- LDA(C, n.topics, control = list(seed = 1234))
attributes(lda)$alpha
```

```{r}
terms(lda,10)
```

Perhaps we should not be too surprised that topic models don't seem well matched to these data.  Recall the differences between the spectrum of the simulated data (the number of topics was clear) to that of the wine data.  There just is not a clearly defined set of separated topics.

