This notebook illustrates the use of vector space methods in R. These manipulate the document-term matrix and can be used to find word embeddings. ]
The methods in this notebook add another package to the standard list.
require(tm)
Loading required package: tm
Loading required package: NLP
require(wordcloud)
Loading required package: wordcloud
Loading required package: RColorBrewer
require(stringr) # not in the tidyverse
Loading required package: stringr
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter(): dplyr, stats
lag(): dplyr, stats
source("text_utils.R") # from web page
Read the wine data from its CSV file. Rather than do this every time, it is generally a good idea to save the “processed” data in a “.sav” file.
Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
[1] 20508 14
WineCorpus <- Corpus(VectorSource(Wine$description))
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.') # otherwise runs together; dot is special regex
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
# WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) # leave for now
Now compute the document term matrix and the row ni
and column mj
marginal counts. The DTM is a little smaller, with fewer types – 5,488 – here than in the first slides because of handling the comma differently. We will be making it smaller still.
dtm <- DocumentTermMatrix(WineCorpus)
dtm
<<DocumentTermMatrix (documents: 20508, terms: 5488)>>
Non-/sparse entries: 545777/112002127
Sparsity : 100%
Maximal term length: 15
Weighting : term frequency (tf)
ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))
word.types <- names(mj) # for convenience and clarity
As usual, check the name of the longest type for possible errors. This one is okay.
word.types[j <- which.max(str_length(word.types))]
[1] "extraordinarily"
The corpus consists of 607,335 tokens.
sum(as.matrix(dtm))
[1] 607355
sum(mj)
[1] 607355
sum(ni)
[1] 607355
Many tokens represents rare types.
sum(mj==1)
[1] 1827
sum(mj==2)
[1] 660
sum(mj==3)
[1] 367
sum(mj[3<mj])
[1] 603107
sum(mj[mj<=3])
[1] 4248
tm
has the function findFreqTerms
to extract the most frequent terms in the DTM (not that this is hard to do directly). Start with a high treshold to avoid too many.
findFreqTerms(dtm,lowfreq=5000)
[1] "and" "aromas" "bodied" "dry" "finish" "medium" "palate" "with"
[9] "acidity" "cherry" "entry" "fruit" "full" "leads" "tannins" "this"
[17] "apple" "fruity" "finishes" "body" "fade"
Bar charts are easy to construct. This one shows the “Zipf” relationshiop rather clearly (at least when the stop words have been included). The function tibble
constructs a tidy data frame.
tibble(word=names(mj), frequency=mj) %>%
top_n(25,frequency) %>%
mutate(word=reorder(word, frequency)) %>%
ggplot(aes(word,frequency)) +
geom_col() + coord_flip()
You can also draw word clouds to summzarize the most common types; eye candy can be useful to attract attention (though it makes it difficult to compare the frequencies… quick, which is the 5th most common word). Don’t try to show too many words. Removing stop words would be very useful in this case.
require(wordcloud)
set.seed(133) # random locations; fix the seed to be able to reproduce
wordcloud(names(mj), mj, max.words=50)
The function zipf_plot
from the helper file \({\tt text\_utils.R}\) shows the Zipf plot. By default, it fits a least squares lines to the first 250 frequencies.
zipf_plot(mj)
Call:
lm(formula = ly ~ lx, data = df[1:min(n.fit, nrow(df)), ])
Coefficients:
(Intercept) lx
11.2897 -0.9475
Spectral methods rely on word co-occurences within some context. The context might be defined by adjacency (bigrams, n-grams) or, in this case, appearing in the same wine review. To make it simpler to find the common types from the rare types, sort the DTM by the type frequencies. This calculation uses the function order
. Here’s an example. order
returns the indices that will sort an object.
x <- c('d','f','a','c','b')
o <- order(x); o
[1] 3 5 4 1 2
x[o]
[1] "a" "b" "c" "d" "f"
Now apply order
to the frequencies of the word types.
o <- order(mj, decreasing=TRUE) # biggest to smallest
dtm <- dtm[,o] # permute the columns
mj <- mj[o]
Now the first types are the most common types.
mj[1:10]
and with aromas medium finish entry fruit body full bodied
53906 28665 18956 16635 11835 9235 9136 9117 7950 7741
Here are some of the smaller types. If you explore the less common types, you will discover a mixture of interesting words (given that these are wine reviews) along with some junk (such as misspelled words, typos, numbers, and dates).
mj[length(mj)-9:0]
uniquely covereed picatta sodas withered verde vinho colossally utility
1 1 1 1 1 1 1 1 1
mary
1
We are not going to learn much about the usage of words that appear so seldom, so set these all to the symbol OOV, short for out-of-vocabulary. We don’t need to do that in the text itself, just in the document term matrix. I will keep the types that appear at least 10 times (as JMP used). That reduces the matrix to 1,742 types. (BTW, tm
includes the function removeSparseTerms
that will wipe out the rare terms from the DTM. I don’t want to wipe them out; I want to consolidate them.)
dtm.oov <- dtm[,10 <= mj]
dtm.oov
<<DocumentTermMatrix (documents: 20508, terms: 1742)>>
Non-/sparse entries: 536404/35188532
Sparsity : 98%
Maximal term length: 15
Weighting : term frequency (tf)
Now what to do about the OOVs. Maybe having a lot of them tells us something about the other words? I’ll append a column that includes the number of these in each document. Doing so turns the document-term matrix into a regular numerical matrix, but that’s okay – we need such a matrix for computing the SVD.
dtm.oov <- cbind(as.matrix(dtm.oov), rowSums(as.matrix(dtm[,mj < 10])))
dim(dtm.oov)
[1] 20508 1743
names.oov <- c(names(mj[10<=mj]), 'OOV')
mj.oov <- c(mj[10<=mj],sum(mj[mj<10]))
ni.oov <- ni # the same as it was
colnames(dtm.oov) <- names.oov
names(mj.oov) <- names.oov
Now check we have not lost any terms.
sum(dtm.oov)
[1] 607355
sum(mj.oov)
[1] 607355
sum(ni.oov)
[1] 607355
We will need a regular R matrix in the following calculations, so convert a special sparse matrix of counts into a regular R matrix using as.matrix
. The matrix dtm.oov
is already in the needed form.
The SVD and latent semantic analysis rely on associations of words in documents: which words appear together in the same document. We can look at correlations individually to see what sort of information lies in these associations. We can explore the associations between word types with the tm
function findAssoc
.
findAssocs(dtm,'zinfandel', corlimit=0.1)
$zinfandel
creek pushed unashamedly heavenly slathered fetzer mesquite wire
0.25 0.14 0.14 0.14 0.14 0.14 0.14 0.14
marsala spareribs raspberry textbook accompany
0.12 0.12 0.11 0.11 0.10
findAssocs(dtm,'pinot', corlimit=0.1)
$pinot
noir grigio gris oregon duck cardamom russian fin noirs found breast
0.64 0.33 0.29 0.17 0.15 0.13 0.13 0.12 0.12 0.12 0.11
tuna coq
0.10 0.10
Paso Robles is a place, so the two words “paso” and “robles” always occur together. Hence, the correlation of these types is 1.
findAssocs(dtm,'paso', corlimit=0.1)
$paso
robles central venison casserole spareribs south underrated such monster
1.00 0.29 0.22 0.22 0.21 0.20 0.18 0.17 0.16
breast california accompany region blended goose from cheddar marsala
0.14 0.13 0.13 0.12 0.12 0.12 0.11 0.11 0.10
buco
0.10
Before computing the singular value decomposition (SVD), it is a good idea to do some normalization. The “correct” normalization is rather time consuming (requiring the inverse of a large matrix, which is tough to do in big-data situations), but the following approximation works nicely in practice.
The calculation of the SVD itself is also rather slow. It’s a simple function call in R, but one that takes a very long time to complete. In the interest of time, we’ll compute the SVD using a subset of 3,000 documents. In a bit, I will show you a magic trick that scales better.
set.seed(2373) # so can reproduce
i.svd <- sample(nrow(dtm.oov), 3000)
dtm.svd <- dtm.oov[i.svd,]
ni.svd <- rowSums(dtm.svd) # number of words in a document, its length
mj.svd <- pmax(1,colSums(dtm.svd)) # frequency of word type in vocabulary (avoid 0 divisor)
The minimum frequency in the original DTM is 10.
min(mj.oov)
[1] 10
After sampling, however, it sometimes is zero. (That’s why pmax
is used above… we don’t want a zero divisor.)
min(mj.svd) # even though we started with at leat
[1] 1
This normalization is almost a standardization to correlations… but a lot faster.
dtm.svd <- dtm.svd/sqrt(ni.svd) # take advantage of R behavior
dtm.svd <- t( t(dtm.svd)/sqrt(mj.svd) )
Now we can compute the SVD of the scaled matrix of counts, \(C_{ij}/\sqrt{n_i m_j}\).
udv <- svd(dtm.svd) # returns u, d, v
names(udv)
[1] "d" "u" "v"
length(udv$d) # singular values
[1] 1743
udv$d[1:4] # normalization implies first = 1
[1] 1.0000000 0.6448068 0.5553221 0.4768119
Remember to combine variables in order to use ggplot
.
tibble(i=1:50, d=udv$d[1:50]) %>%
ggplot(aes(i,d)) +
geom_point() + scale_x_log10() +
labs(title="Spectrum with CCA Scaling", x="Dimension", y="Singular Value")
The rows of the \(U\) matrix identify documents, so think of these as “new variables” that describe the documents. The rows of \(V\) identify word types and so provide an “interpretation” of the \(U_j\) columns. The elements of \(U\) sometimes reveal clusters of related documents, whereas the elements of \(V\) reveal the components of these new variables.
As is regular principal components analysis, the first component in these data captures the number of words (which happens to be an important variable in the wine analysis).
plot(ni[i.svd], udv$u[,1], xlab="Number of Tokens", ylab=expression("U"[1]))
In this example, clusters are easy to see. I don’t like the various alternatives to pairs
offered by the Tidy collection, so its back to regular R graphics. (Why not: suppose I don’t want the colors to be shown in the matrix of plots, or I’d like to use expressions to label the variables?)
set.seed(234)
ii <- sample(nrow(udv$u), 500) # fewer points
pairs(udv$u[ii,2:5],
labels=c(expression("U"[2]), expression("U"[3]), # subscripts in plot label
expression("U"[4]), expression("U"[5])) )
It would be neat if those clusters were associated with the colors of the wines. The following plot shows that is not the case.
color <- tolower(Wine$color[i.svd][ii])
color[color=='white'] <- 'gold' # white does not show up well!
pairs(udv$u[ii,2:5], col=color,
labels=c(expression("U"[2]), expression("U"[3]), # subscripts in plot label
expression("U"[4]), expression("U"[5])) )
Although the obvious clusters defined by these new variables do not correspond to the wine color, it is easy to see how we can use \(U_4\) and perhaps \(U_5\) to separate red from white wines. Rather than speculate, we can simply fit a logistic regression.
j <- 1:10
U <- udv$u[,j]
y <- ifelse(Wine$color[i.svd] == 'Red',1,0)
lregr <- glm(y ~ U, family=binomial, na.action=na.exclude)
summary(lregr)
Call:
glm(formula = y ~ U, family = binomial, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1499 -0.0343 0.0038 0.0726 3.2551
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6021 1.3334 -0.452 0.65161
U1 90.0483 74.8393 1.203 0.22889
U2 -2.3739 7.9956 -0.297 0.76655
U3 -29.9632 6.7501 -4.439 9.04e-06 ***
U4 -383.3268 23.3412 -16.423 < 2e-16 ***
U5 81.7788 14.9954 5.454 4.94e-08 ***
U6 157.3792 10.9946 14.314 < 2e-16 ***
U7 -18.7985 7.2497 -2.593 0.00951 **
U8 89.7537 11.2374 7.987 1.38e-15 ***
U9 59.7435 8.9418 6.681 2.37e-11 ***
U10 -16.9630 12.4151 -1.366 0.17184
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3441.04 on 2539 degrees of freedom
Residual deviance: 421.57 on 2529 degrees of freedom
(460 observations deleted due to missingness)
AIC: 443.57
Number of Fisher Scoring iterations: 9
These predictions separate the wine types nicely.
data_frame(fit=fitted.values(lregr), color=Wine$color[i.svd]) %>%
ggplot(aes(x=fit,color=color)) + geom_freqpoly()
But what are these variables that so clearly separate the wines? Once you see the names of the components of \(V_4\) it becomes clear why this variable works to separate the two types of wines.
First attach names. Notice that the names of the rows of \(V\) match the names of the columns of \(C\), the document term matrix.
V <- udv$v[,1:25]
rownames(V) <- colnames(dtm.svd)
Now draw a plot of the “loadings” (as they are called in factor analysis).
plot(V[,4], V[,5], xlab= expression("V"[4]), ylab=expression("V"[5]), col='gray')
label <- 0.05 < sqrt(V[,4]^2 + V[,5]^2) # pick those far from the origin
text(V[label,4], V[label,5], rownames(V)[label], cex=0.8, srt=45)
The function plot_loadings
from the collection of R scripts in the file \(\tt text\_utils.R\) automates this task. For example, there’s little evident reason to think that \(V_2\) and \(V_3\) would help distinguish the wine’s color.
plot_loadings(V, 2, 3, threshold=0.05, cex=0.8)
Before looking at scaling these methods up to larger amounts of data, what are those obvious clusters in the data? They are not related to color, and skimming the key words, don’t seem related to the variety either (e.g., cabernet versus zinfandel).
This plot reveals the answer. Gee, what do you think happened?
plot(i.svd, U[,2], xlab="Document Position", ylab=expression("U"[2]))
Here are several descriptions in the “early” phase of the data.
Wine$description[i.svd[U[,2]< -0.025]][1:5]
[1] "Aromas of banana cream pie, praline, and spicy poached pear follow through on a soft, supple entry to a dryish medium body with soft, quince and Meyer lemon cream notes. Finishes with a soft, melon and apple fade. Nice purity fruity and elegance, in an understated table friendly style."
[2] "Interesting aromas of buttercream, circus peanut, and pistachio brittle follow through on a very soft, gentle entry to a dryish medium body with cherry and green apple core notes. Finishes with a crisp, lemon and mild daikon radish accented fade. Different."
[3] "Toasty rice pudding and baked nectarine aromas follow through on a soft supple entry to a dry-yet-fruity medium body with honeydew lime and smoky mineral notes. Finishes with a tangy toasted rice pilaf and like fade. Interesting."
[4] "Aromas of cotton candy and strawberries on angel food cake follow through on a lightly spritzy entry to a dryish medium body with tart green apples and peach skin notes. Finishes with a very crisp stony mineral, lemon, and praline accented fade. Pair with sautéed scallops with brown butter sauce."
[5] "Cocoa dusted craisins, potter's clay, and raisin-fig chutney follow through on a lively, smooth entry to a tangy medium-to-full body with tangerine peel, menthol, and pomegranate accents. Finishes with a tart, refreshing cranberry chutney-like fade. A fun exciting to pair with spicy Asian pork or tun dishes."
And some of those from the other period. Do you notice any differences in the “style” of the review?
Wine$description[i.svd[U[,2]> 0.025]][1:5]
[1] "Subdued, minerally nose. A lean entry leads to a drying, medium-bodied palate with grippy tannins. Structured and austere. Drink now."
[2] "Ripe black fruit aromas have a lavish-but-integrated chocolatey oak accent. A lush entry leads a rounded, full-bodied palate with velvety tannins. Drying, lengthy, balanced finish. An extremely well made PV that will work beautifully with roasted meats."
[3] "Perfumed, gamey dried fruit and leather aromas show a traditional range of complexities. A rich entry leads to a supple, moderately full-bodied palate with chewy tannins. A big, rounded style with a touch of wood on the grippy finish. Near- to mid-term cellar."
[4] "Austere, high toned mineral and citrus peel nose. A rich entry leads to a weighty, drying, moderately full-bodied palate with a ripe center. A very traditional style with good cut. Will blossom with mid-term cellaring."
[5] "Heavily Botrytised, heady caramel and molasses nose. A sharp entry leads to a racy, medium-bodied palate with intense sweetness and extremely concentrated dried fruit flavors. Quite youthful. This really needs mid-term cellaring and has the structure to age for two or three decades. Drink in small measures."
We could use these variables to predict the colors of the other wines, but it would be nice to be able to compute the SVD for all of the data rather than just these 3000.
We can approximate the SVD of the entire document-term matrix using the method known as random projection. The key to the method is that if you only want, say 300 singular vectors, you can find them via random matrix multiplications. The function random_projection_svd
(also in the \(\tt text\_utils.R\) file) implements this algorithm.
To try to convince you it’s not magic, let’s compute the SVD via random projection and compare the results to what we get from svd
. Random projection is quite a bit faster than the built-in function. The more power iterations the better, particulary for the terms associated with smaller singular values.
set.seed(4843) # results depend on seed
rp_udv <- random_projection_svd(dtm.svd, d=50, power.iter=3, verbose=TRUE)
Starting power iterations...
Completed power iteration...
Completed power iteration...
Completed power iteration...
Completed QR factorization...
Like svd
, this functions returns a named list of results.
names(rp_udv)
[1] "d" "u" "v"
The larger singular values almost match those from the “exact” calculation.
plot(rp_udv$d, udv$d[1:50], log='xy',main="Singular Values",
xlab="Random Projection", ylab="Exact")
abline(a=0,b=1)
As do the principal components.
plot(rp_udv$u[,4], udv$u[,4], main="Principal Component",
xlab="Random Projection", ylab="Exact")
abline(a=0,b=1)
And the labels.
plot(rp_udv$v[,4], udv$v[,4], main="Loadings",
xlab="Random Projection", ylab="Exact")
abline(a=0,b=1)
To get the full decomposition, use dtm.oov
rather than the sampled version (dtm.svd
).
set.seed(4843) # results depend on seed
rp_udv <- random_projection_svd(dtm.oov, d=50, power.iter=3, verbose=TRUE)
Starting power iterations...
Completed power iteration...
Completed power iteration...
Completed power iteration...
Completed QR factorization...