The methods in this notebook require the `topicmodels`

package in addition to the standard list. (A paper accompanied the release of this package; see Grun and Hornik (2011) topicmodels: An R Package for Fitting Topic Models, J of Statistical Software, 40)

`require(tm)`

```
Loading required package: tm
Loading required package: NLP
```

`require(topicmodels)`

`Loading required package: topicmodels`

`require(stringr)`

`Loading required package: stringr`

`require(tidyverse)`

```
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter(): dplyr, stats
lag(): dplyr, stats
```

`source("text_utils.R")`

You can learn a lot about what a topic model does by using it to simulate text, which we can then study using methods covered previously. Here’s a compact summary of the algorithm

Given α and K topics, generate each document in the simulated corpus as follows… 1. Draw topic proportions θ | α ∼ Dir(α) for the document These are probabilities for sampling the words in the next step. 2. For each word w_i in the document (a) Draw topic assignment z_i | θ ∼ Mult(θ) z_i indicates topic (b) Draw word w_i|z_i,β_{1:K} ∼ Mult(β_{z_i}) β_{z_i} is topic dist

To illustrate the role of the parameter alpha (\(\alpha\)) and how it affects the probabilities \(\theta\) for a document, the function `rdirichlet`

(from \(\tt text\_utils.R\)) simulates these draws.

Each draw is a discrete probability distribution over the number of categories. The function has two arguments, alpha and the number of groups. Set the number of topics \(K = 10\). If alpha is small, then the probability is concentrated in one topic. In this case, documents will be nearly pure, with all words drawn from a single topic, or perhaps 2.

```
alpha <- 0.05
K <- 10 # number of topics
par(mfrow=c(2,2))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
```

As alpha increases, the distribution over topics becomes diffuse.

```
alpha <- 0.5
par(mfrow=c(2,2))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
```

```
alpha <- 5
par(mfrow=c(2,2))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
plot(rdirichlet(alpha,K)); plot(rdirichlet(alpha,K))
```

Consequently, small values of \(\alpha\) imply documents are pure, drawn from very few topics, whereas larger values of \(\alpha\) indicate documents that are “blurry” and mix the topics.

Another important characteristic of the model is how distinct the topics are themselves. Do topics have overlapping words, or are they mutually exclusive? For these simulations, the constant \(\alpha_P\) controls this property. Small \(\alpha_P\) means essentially distinct topics, larger values imply more common words.

```
n.vocab <- 1000 # size of vocabulary
P <-matrix(0,nrow=K,ncol=n.vocab) # dist over words for each topic
alpha.P <- 0.05 # small alpha implies less overlap [ 0.05 0.10 ]
set.seed(6382)
for(i in 1:K) P[i,] <- rdirichlet(alpha.P,n.vocab)
P <- P[,order(colSums(P), decreasing=TRUE)] # sort so common types are first
rowSums(P) # check that each sums to 1
```

` [1] 1 1 1 1 1 1 1 1 1 1`

Here are some examples. (Using the square roots of the probabilities shows a bit more of the variation; without it there’s a blob near zero.)

```
par(mfrow=c(1,2))
plot(P[1,], xlab="Vocabulary", ylab="Probability") # topic dist
plot(sqrt(P[1,]),sqrt(P[2,]), # disjoint if alpha.P = 0.01, some common if .1
xlab=expression(sqrt("P"[1])),ylab=expression(sqrt("P"[2])))
```