The methods in this notebook require the topicmodels
package in addition to the standard list. (A paper accompanied the release of this package; see Grun and Hornik (2011) topicmodels: An R Package for Fitting Topic Models, J of Statistical Software, 40)
require(tm)
Loading required package: tm
Loading required package: NLP
require(topicmodels)
Loading required package: topicmodels
require(stringr)
Loading required package: stringr
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter(): dplyr, stats
lag(): dplyr, stats
source("text_utils.R")
You can learn a lot about what a topic model does by using it to simulate text, which we can then study using methods covered previously. Here’s a compact summary of the algorithm
Given α and K topics, generate each document in the simulated corpus as follows… 1. Draw topic proportions θ | α ∼ Dir(α) for the document These are probabilities for sampling the words in the next step. 2. For each word w_i in the document (a) Draw topic assignment z_i | θ ∼ Mult(θ) z_i indicates topic (b) Draw word w_i|z_i,β_{1:K} ∼ Mult(β_{z_i}) β_{z_i} is topic dist
To illustrate the role of the parameter alpha (\(\alpha\)) and how it affects the probabilities \(\theta\) for a document, the function rdirichlet
(from \(\tt text\_utils.R\)) simulates these draws.
Each draw is a discrete probability distribution over the number of categories. The function has two arguments, alpha and the number of groups. Set the number of topics \(K = 10\). If alpha is small, then the probability is concentrated in one topic. In this case, documents will be nearly pure, with all words drawn from a single topic, or perhaps 2.
alpha <- 0.05
K <- 10 # number of topics
par(mfrow=c(2,2))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
As alpha increases, the distribution over topics becomes diffuse.
alpha <- 0.5
par(mfrow=c(2,2))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
alpha <- 5
par(mfrow=c(2,2))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
Consequently, small values of \(\alpha\) imply documents are pure, drawn from very few topics, whereas larger values of \(\alpha\) indicate documents that are “blurry” and mix the topics.
Another important characteristic of the model is how distinct the topics are themselves. Do topics have overlapping words, or are they mutually exclusive? For these simulations, the constant \(\alpha_P\) controls this property. Small \(\alpha_P\) means essentially distinct topics, larger values imply more common words.
n.vocab <- 1000 # size of vocabulary
P <-matrix(0,nrow=K,ncol=n.vocab) # dist over words for each topic
alpha.P <- 0.05 # small alpha implies less overlap [ 0.05 0.10 ]
set.seed(6382)
for(i in 1:K) P[i,] <- rdirichlet(alpha.P,n.vocab)
P <- P[,order(colSums(P), decreasing=TRUE)] # sort so common types are first
rowSums(P) # check that each sums to 1
[1] 1 1 1 1 1 1 1 1 1 1
Here are some examples. (Using the square roots of the probabilities shows a bit more of the variation; without it there’s a blob near zero.)
par(mfrow=c(1,2))
plot(P[1,], xlab="Vocabulary", ylab="Probability") # topic dist
plot(sqrt(P[1,]),sqrt(P[2,]), # disjoint if alpha.P = 0.01, some common if .1
xlab=expression(sqrt("P"[1])),ylab=expression(sqrt("P"[2])))
Now put labels on the “words”. Low entropy words are those that by-and-large appear in only one topic. Higher entropy words appear in several and don’t help much to identify the topic. (The function entropy
is defined in \(\tt text\_utils.R\).)
ent <- apply(P,2,entropy) # calc entropy for each word
low.ent <- ent < 0.25 # low entropy -> predictable
wordTypeTopic <- rep("Mix", ncol(P))
wordTypeTopic[low.ent] <- paste(apply(P[,low.ent],2,which.max))
# check assignment
plot(ent, xlab="word types", ylab="entropy")
abline(h=0.25, col='gray')
wordTypeTopic[1:10]
[1] "Mix" "Mix" "Mix" "Mix" "Mix" "Mix" "2" "Mix" "10" "10"
round(P[,1:10],2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.05 0.00 0.07 0.01 0.05 0.01 0.00 0.00 0.00 0.00
[2,] 0.02 0.01 0.00 0.00 0.00 0.00 0.08 0.00 0.00 0.00
[3,] 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[4,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[5,] 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00
[6,] 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00
[7,] 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[8,] 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[9,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00
[10,] 0.00 0.02 0.00 0.00 0.00 0.07 0.00 0.06 0.07 0.07
Now use these topic distributions to simulate documents from the topic model.
alpha <- 0.05 # mix of topics within documents
n <- 2000 # number of documents
theta <- matrix(0,nrow=n,ncol=K) # expected topic mix in each document
Z <- matrix(0,nrow=n,ncol=K) # realized words from each topic
avg.len <- 100 # avg document length
set.seed(6783)
doc.len <- sort(rpois(n,avg.len), decreasing=TRUE) # Poisson lengths
for(i in 1:n) {
theta[i,] <- rdirichlet(alpha, K)
Z[i,] <- as.vector(rmultinom(1,doc.len[i],theta[i,])) # mix of topics
}
Plot the mix of topics in some documents.
plot(theta[1,], xlab="Topic", ylab="Share of Vocabulary", main="Topic Mix for One Document")
Compute the document-term matrix.
C <- matrix(0,nrow=n, ncol=n.vocab)
C.ev <- C; # expected value of C
for(i in 1:n) {
C.ev[i,] <- doc.len[i] * theta[i,] %*% P # prob distribution over words for each doc
for(k in 1:K) if(Z[i,k]>0) C[i,]<-C[i,]+rmultinom(1,Z[i,k],P[k,])
}
Do we get something like a Zipf distribution? A bit more extreme concave shape.
mj <- colSums(C)
zipf_plot(mj)
44 y values <= 0 omitted from logarithmic plot
Call:
lm(formula = ly ~ lx, data = df[1:min(n.fit, nrow(df)), ])
Coefficients:
(Intercept) lx
8.5047 -0.4912
Let’s see what LSA thinks of the simulated document-term matrix, with CCA scaling. Does a plot of the singular vales from the wine data resemble this one?
mj <- colSums(C) # freq of word types
ni <- rowSums(C) # doc.len
C.cca <- C / sqrt(ni)
C.cca <- t(t(C) / sqrt(pmax(1,mj))) # avoid zero divisor
udv.cca <- svd(C.cca)
U <- udv.cca$u
V <- udv.cca$v
plot(udv.cca$d[1:100], log="xy", xlab="Component", ylab="Singular Value")
As in the LSA analysis of the wines, the leading singular vector is determined by the number of words in a document.
par(mfrow=c(1,2))
plot(sqrt(ni),-udv.cca$u[,1], xlab=expression(sqrt("n"["i"])),ylab=expression("U"[1]))
plot(sqrt(mj),-udv.cca$v[,1], xlab=expression(sqrt("m"["j"])),ylab=expression("V"[1]))
The other components are more interesting and different from those seen in the LSA of the wine ratings.
pairs(U[,2:5],
labels=c(expression("U"[2]), expression("U"[3]),
expression("U"[4]), expression("U"[5])))
j <- 2; k <- j+1;
rownames(V) <- wordTypeTopic
plot_loadings(V, 2, 3, 0.07,cex=1.0)
plot_loadings(V, 4, 5, 0.07,cex=1.0)
Let’s see how well methods from topicmodels
recover the structure in the simulated topics. The software is simple to run. The code sometimes runs better if you are willing to tell it what to use for \(\alpha\) (the parameter that controls the purity of documents). You need to pick the number of topics. Set the seed
argument to get reproducibility.
n.topics = 10
colnames(C) <- wordTypeTopic
lda <- LDA(C, n.topics, control = list(seed = 1234))
lda
A LDA_VEM topic model with 10 topics.
The properties of the estimated model are held in attributes
of the lda
object.
names(attributes(lda))
[1] "control" "alpha" "call" "Dim" "k"
[6] "terms" "documents" "beta" "gamma" "wordassignments"
[11] "loglikelihood" "iter" "logLiks" "n" "class"
It estimated the value of \(\alpha\) quite well (though we don’t have a standard error on that).
alpha
[1] 0.05
attributes(lda)$alpha
[1] 0.05355063
Beta holds logs of the probability distributions over the word types. The first row of beta, for example, estimates the distribution over types that go into the first topic. These provide estimates of the probability distributions \(P_k\). Notice that the indices are arbitrary: the estimated model may label \(P_1\) as the 5th topic.
beta <- attributes(lda)$beta
dim(beta)
[1] 10 1000
colnames(beta) <- attributes(lda)$terms # the types
Here’s an example. Based on the words with higher probabililty, the first estimated probability distribution evidently estimates \(P_7\) and the second estimates \(P_3\).
p1 <- exp(beta[1,])
sum(p1)
[1] 1
sort(p1, decreasing=TRUE)[1:7]
Mix Mix 7 7 7 Mix Mix
0.05832417 0.05818662 0.05565437 0.04518013 0.03825313 0.03473340 0.03080214
p2 <- exp(beta[2,])
sum(p2)
[1] 1
sort(p2, decreasing=TRUE)[1:7]
Mix 3 Mix Mix Mix 3 Mix
0.12531119 0.07178982 0.04802098 0.04178264 0.03468565 0.03186862 0.02593108
Those are good estimates of the distributions.
par(mfrow=c(1,2))
plot(p1, P[7,]); abline(a=0,b=1,col='gray')
plot(p2, P[3,]); abline(a=0,b=1,col='gray')
The component gamma
estimates the mix of topics in documents, the parameter denoted \(\theta\) in the simulation and slides. It’s easy to get confused with so many parameters, so use the shapes of the estimates to keep you on the right track. There has to be an dimension in gamma
that goes with the number of rows of the data (rows of the DTM). We have 2000 documents and 10 topics – matching the shape of gamma
.
theta.hat <- attributes(lda)$gamma
dim(theta.hat) # mix of topics over documents
[1] 2000 10
When looking at these, remember that the first topic is “really” the seventh in the simulation (\(P_7\)), and that the second topic corresponds to \(P_3\).
round(theta.hat[1:10,],2) # for first 10 documents
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.00 0.00 0.11 0.00 0.00 0.88 0.00 0.00 0.00 0.00
[2,] 0.24 0.00 0.00 0.34 0.00 0.08 0.13 0.00 0.20 0.00
[3,] 0.04 0.00 0.00 0.00 0.00 0.00 0.58 0.00 0.29 0.09
[4,] 0.39 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.58
[5,] 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.28 0.69
[6,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.02 0.95
[7,] 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[8,] 0.00 0.01 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.96
[9,] 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40
[10,] 0.42 0.00 0.00 0.33 0.01 0.23 0.00 0.00 0.00 0.00
How well do the topic assignments for these documents correspond to the simulated topics? To answer, consider document 7 in this table. The simulation parameters for this document show that its very pure, coming from the third topic – which is exactly what the topic model estimates. (Again, this is tricky. The second estimated topic is really the third true topic.)
round(theta[7,],3)
[1] 0.000 0.000 0.996 0.004 0.000 0.000 0.000 0.000 0.000 0.000
Computing the matrix of correlations shows you which is which: look for a correlation near 1 in a row or column. For instance, \(P_1\) is obviously estimated in the 9th position, and \(P_2\) in the 8th. \(P_3\) is the second estimate, as found previously. \(P_4\) is less clear, but evidently estimated by the 7th – but neither \(P_4\) nor \(P_{10}\) are well determined.
r <- cor(theta, theta.hat)
rownames(r) <- paste("P",1:K)
colnames(r) <- paste("Est",1:K)
round(r,2)
Est 1 Est 2 Est 3 Est 4 Est 5 Est 6 Est 7 Est 8 Est 9 Est 10
P 1 -0.12 -0.11 -0.11 -0.07 -0.12 -0.12 -0.17 -0.11 1.00 -0.13
P 2 -0.11 -0.10 -0.11 -0.06 -0.12 -0.11 -0.18 1.00 -0.10 -0.12
P 3 -0.10 1.00 -0.09 -0.03 -0.12 -0.09 -0.15 -0.10 -0.12 -0.10
P 4 -0.11 -0.11 -0.11 -0.07 -0.11 -0.12 0.70 -0.13 -0.11 -0.12
P 5 -0.10 -0.09 1.00 -0.06 -0.10 -0.10 -0.16 -0.11 -0.11 -0.12
P 6 -0.10 -0.10 -0.12 -0.08 -0.11 -0.13 -0.17 -0.13 -0.13 1.00
P 7 0.96 -0.09 -0.10 0.57 -0.10 -0.12 -0.17 -0.11 -0.12 -0.11
P 8 -0.11 -0.10 -0.10 -0.07 -0.11 1.00 -0.17 -0.11 -0.12 -0.13
P 9 -0.09 -0.12 -0.10 -0.06 1.00 -0.11 -0.16 -0.11 -0.12 -0.11
P 10 -0.11 -0.09 -0.11 -0.06 -0.10 -0.10 0.63 -0.11 -0.13 -0.10
The helper function topics
shows the most likely topics in each document. You pick how many topics to see (but it does not show you the weights on these).
x <- topics(lda, 2)
dim(x)
[1] 2 2000
x[,1:10]
The helper terms
shows the word types in each topic. Again, no weights. This function will be more useful when the word types are more interesting.
x <- terms(lda,5)
dim(x)
[1] 5 10
x
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
[1,] "Mix" "Mix" "Mix" "Mix" "9" "Mix" "Mix" "2" "Mix" "6"
[2,] "Mix" "3" "5" "Mix" "Mix" "8" "10" "Mix" "1" "6"
[3,] "7" "Mix" "5" "7" "Mix" "Mix" "10" "Mix" "1" "Mix"
[4,] "7" "Mix" "5" "Mix" "Mix" "Mix" "Mix" "Mix" "Mix" "6"
[5,] "7" "Mix" "Mix" "Mix" "Mix" "Mix" "10" "2" "Mix" "Mix"
names(attributes(lda))
[1] "control" "alpha" "call" "Dim" "k"
[6] "terms" "documents" "beta" "gamma" "wordassignments"
[11] "loglikelihood" "iter" "logLiks" "n" "class"
ll <- attributes(lda)$loglikelihood
str(ll)
num [1:2000] -639 -732 -755 -668 -669 ...
For this simulated example (in which the topics are well separated), topicmodels
is able to separate them nicely.
How, though, would we know to estimate 10 topics. In practice, you don’t and have to fit several and compare the fits using the overall log-likelihood of the estimated model. The log-likelihood attribute gives you the contribution to the overall log-likelihood from each document. (You can use this sometimes to spot the odd case. None are particularly unusual here – because these are simulated.)
ll <- attributes(lda)$loglikelihood
length(ll)
[1] 2000
hist(ll)
Add these up, or just call the function logLik
. The logLik
function adds them up for you and gives a version of degrees of freedom which counts the size of the model.
sum(ll)
[1] -994639
logLik(lda)
'log Lik.' -994639 (df=10001)
To find the number of topics, compare the log-likelihoods for several fits with different numbers of topics. Negative log-likelihoods are analogous to residual sums of squares in regression. As in that situation (like \(R^2\)), the log-likelihood improves with model size/complexity. The package authors use cross-validation to pick the number of topics. (It was not clear how they did that, however.)
n.topics <- 5
lda5 <- LDA(C, n.topics, control = list(seed = 1234))
logLik(lda5)
'log Lik.' -1072540 (df=5001)
n.topics <- 8
lda8 <- LDA(C, n.topics, control = list(seed = 1234))
logLik(lda8)
'log Lik.' -1013195 (df=8001)
n.topics <- 9
lda9 <- LDA(C, n.topics, control = list(seed = 1234))
logLik(lda9)
'log Lik.' -994021.6 (df=9001)
n.topics <- 11
lda11 <- LDA(C, n.topics, control = list(seed = 1234))
logLik(lda11)
'log Lik.' -975162.9 (df=11001)
Like LSA, topic models take a bag-of-words approach, so we begin by preparing the document-term matrix.
Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
[1] 20508 14
This time remove the stop words.
WineCorpus <- Corpus(VectorSource(Wine$description))
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.') # avoid run-in words
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english"))
Now compute the document term matrix and the row ni
and column mj
marginal totals. This DTM is a little smaller that prior examples computed from the wine reviews, with fewer types (5,412) than in the LSA analysis.
dtm <- DocumentTermMatrix(WineCorpus)
dtm
<<DocumentTermMatrix (documents: 20508, terms: 5412)>>
Non-/sparse entries: 475966/110513330
Sparsity : 100%
Maximal term length: 15
Weighting : term frequency (tf)
ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))
word.types <- names(mj) # for convenience and clarity
The corpus consists of 490,734 tokens (compared to 607,335 tokens with those stop words included).
sum(as.matrix(dtm))
[1] 490734
Rare types remain and will be replaced by the symbol “OOV” as in the LSA analysis. Stopwords generally are far too common to be in this collection, so these counts are similar to the prior analysis with stopwords.
sum(mj==1)
[1] 1819
sum(mj==2)
[1] 656
sum(mj==3)
[1] 365
The following commands are the same as in the LSA analysis.
o <- order(mj, decreasing=TRUE) # biggest to smallest
dtm <- dtm[,o] # permute the columns
mj <- mj[o]
dtm.oov <- dtm[,10 <= mj]
dtm.oov <- cbind(as.matrix(dtm.oov), rowSums(as.matrix(dtm[,mj < 10])))
names.oov <- c(names(mj[10<=mj]), 'OOV')
mj.oov <- c(mj[10<=mj],sum(mj[mj<10]))
ni.oov <- ni
colnames(dtm.oov) <- names.oov
names(mj.oov) <- names.oov
dim(dtm.oov)
[1] 20508 1692
We need to tell the software how many topics to fit. Because the algoritm uses a randomized procedure to initialize estimates (such as for the topic distributions), set the seed to be able to reproduce the results. It may also be useful to set the parameter \(\alpha\) that controls the “document complexity”.
n.topics = 8 # Why 8? Eight format nicely on the output!
lda <- LDA(dtm.oov, n.topics, control = list(seed = 1234))
lda
A LDA_VEM topic model with 8 topics.
This estimate seeems too large, but let’s see what we got. (Setting a smaller value of alpha
seems a bit artificial and forces the software to fit a model it does not “like” – and didn’t seem to help much.)
attributes(lda)$alpha
[1] 49.71126
Use the functions terms
and topics
to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics.
terms(lda,10)
And topics
gives the composition of the reviews, showing the leading topics in each document. Topic 10 seems to be the “everything” topic.
x <- t(topics(lda,3))
x[1:10,]
It would seem there are too many “common” words that have overwhelmed the topic modeling. One simple thing to do in this case is remove from the analysis words that show up in almost every document.
C <- as.matrix(dtm.oov)
dim(C)
[1] 20508 1692
prop <- colSums(0 < C)/nrow(C)
tibble(prop=prop) %>%
ggplot(aes(prop)) + geom_histogram() + scale_x_log10(breaks=c(0.01,0.05,0.25,1))
sort(prop, decreasing=TRUE)[1:20]
aromas medium finish entry body fruit full bodied leads
0.9166179 0.8060269 0.5691437 0.4498732 0.4443632 0.4083772 0.3791691 0.3772674 0.3201677
acidity fruity OOV palate dry fade finishes tannins cherry
0.3039789 0.2927638 0.2888141 0.2728691 0.2705286 0.2618490 0.2496587 0.2477082 0.2396626
apple follow
0.2253755 0.2106007
Let’s try topic modeling without those that appear in more than 25% of the documents.
C <- C[,order(prop, decreasing=TRUE)]
C <- C[,-(1:15)]
colnames(C)[1:4]
[1] "finishes" "tannins" "cherry" "apple"
n.topics = 8
lda <- LDA(C, n.topics, control = list(seed = 1234))
lda
A LDA_VEM topic model with 8 topics.
The estimate of \(\alpha\) remains large, but let’s see what we got this time.
attributes(lda)$alpha
[1] 39.18128
Use the functions terms
and topics
to obtain interpretive clues to what the topic model has found. This shows the top 8 word types in the topics. Again, the topics are dominated by the common words.
terms(lda,10)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
[1,] "ripe" "cherry" "cherry" "spice" "tannins" "cherry" "flavors" "apple"
[2,] "dried" "apple" "yet" "sweet" "apple" "tart" "tannins" "good"
[3,] "cherry" "finishes" "tart" "finishes" "good" "sweet" "oak" "oak"
[4,] "tart" "follow" "light" "tangy" "light" "dryish" "black" "light"
[5,] "berry" "ripe" "tannins" "berry" "flavors" "follow" "dried" "soft"
[6,] "oak" "tangy" "tangy" "round" "finishes" "tannins" "earth" "pear"
[7,] "pepper" "tart" "crisp" "mineral" "dried" "finishes" "follow" "ripe"
[8,] "yet" "pepper" "spice" "soft" "crisp" "accented" "depth" "spice"
[9,] "flavors" "oak" "lemon" "flavors" "oak" "yet" "pear" "dried"
[10,] "apple" "pear" "supple" "nice" "rich" "roasted" "wine" "lemon"
So I will get more aggressive and wipe out more of these common words.
prop <- colSums(0 < C)/nrow(C)
sum(0.10 < prop)
[1] 33
names(prop[0.10 < prop])
[1] "finishes" "tannins" "cherry" "apple" "follow" "flavors" "yet" "tart"
[9] "dried" "oak" "spice" "good" "light" "tangy" "soft" "ripe"
[17] "supple" "rich" "round" "black" "sweet" "nice" "crisp" "depth"
[25] "notes" "pear" "wine" "dryish" "lemon" "balanced" "earth" "berry"
[33] "citrus"
C <- C[,-(1:33)]
Check to make sure I’ve not wipe out an entire document.
any(rowSums(C)==0)
[1] FALSE
Okay, now fit the topic model.
n.topics = 8
lda <- LDA(C, n.topics, control = list(seed = 1234))
attributes(lda)$alpha
[1] 31.51749
terms(lda,10)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
[1,] "slightly" "lively" "pair" "silky" "pepper" "peach" "pepper"
[2,] "moderately" "concentration" "accented" "spicy" "earthy" "silky" "accented"
[3,] "lively" "cedar" "pepper" "mineral" "cedar" "moderately" "vanilla"
[4,] "chocolate" "earthy" "tannin" "custard" "skin" "touch" "chocolate"
[5,] "pie" "tannin" "long" "melon" "mineral" "spicy" "baked"
[6,] "cedar" "herbal" "melon" "baked" "lively" "apricot" "roasted"
[7,] "long" "nicely" "moderately" "vanilla" "creamy" "accented" "nut"
[8,] "red" "varietal" "firm" "roasted" "baked" "peel" "great"
[9,] "pineapple" "touch" "earthy" "pineapple" "red" "baked" "delicate"
[10,] "nut" "drink" "well" "green" "currant" "varietal" "slightly"
Topic 8
[1,] "firm"
[2,] "chocolate"
[3,] "spicy"
[4,] "accented"
[5,] "earthy"
[6,] "long"
[7,] "peel"
[8,] "pie"
[9,] "plum"
[10,] "cedar"
Perhaps better, but it seems there just aren’t very many well-separated topics in these reviews. So look for fewer, like 2.
n.topics = 2
lda <- LDA(C, n.topics, control = list(seed = 1234))
attributes(lda)$alpha
[1] 39.9038
terms(lda,10)
Topic 1 Topic 2
[1,] "spicy" "concentration"
[2,] "chocolate" "lively"
[3,] "accented" "earthy"
[4,] "baked" "skin"
[5,] "moderately" "cedar"
[6,] "table" "tannin"
[7,] "pepper" "moderate"
[8,] "vanilla" "mineral"
[9,] "silky" "touch"
[10,] "long" "pepper"
Perhaps we should not be too surprised that topic models don’t seem well matched to these data. Recall the differences between the spectrum of the simulated data (the number of topics was clear) to that of the wine data. There just is not a clearly defined set of separated topics.