This notebook illustrates the use of vector space methods in R. These manipulate the document-term matrix and can be used to find word embeddings. ]

Setup R

The methods in this notebook add another package to the standard list.

require(tm)
Loading required package: tm
Loading required package: NLP
require(wordcloud)
Loading required package: wordcloud
Loading required package: RColorBrewer
require(stringr)        # not in the tidyverse
Loading required package: stringr
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter():   dplyr, stats
lag():      dplyr, stats
source("text_utils.R")  # from web page

Prepare document-term matrix

Read the wine data from its CSV file. Rather than do this every time, it is generally a good idea to save the “processed” data in a “.sav” file.

Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
[1] 20508    14
WineCorpus <- Corpus(VectorSource(Wine$description))
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.')     # otherwise runs together; dot is special regex
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
# WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english"))  # leave for now

Now compute the document term matrix and the row ni and column mj marginal counts. The DTM is a little smaller, with fewer types – 5,488 – here than in the first slides because of handling the comma differently. We will be making it smaller still.

dtm <- DocumentTermMatrix(WineCorpus)
dtm
<<DocumentTermMatrix (documents: 20508, terms: 5488)>>
Non-/sparse entries: 545777/112002127
Sparsity           : 100%
Maximal term length: 15
Weighting          : term frequency (tf)
ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))
word.types <- names(mj)   # for convenience and clarity

As usual, check the name of the longest type for possible errors. This one is okay.

word.types[j <- which.max(str_length(word.types))]
[1] "extraordinarily"

The corpus consists of 607,335 tokens.

sum(as.matrix(dtm))
[1] 607355
sum(mj)
[1] 607355
sum(ni)
[1] 607355

Many tokens represents rare types.

sum(mj==1)
[1] 1827
sum(mj==2)
[1] 660
sum(mj==3)
[1] 367
sum(mj[3<mj])
[1] 603107
sum(mj[mj<=3])
[1] 4248

tm has the function findFreqTerms to extract the most frequent terms in the DTM (not that this is hard to do directly). Start with a high treshold to avoid too many.

findFreqTerms(dtm,lowfreq=5000)
 [1] "and"      "aromas"   "bodied"   "dry"      "finish"   "medium"   "palate"   "with"    
 [9] "acidity"  "cherry"   "entry"    "fruit"    "full"     "leads"    "tannins"  "this"    
[17] "apple"    "fruity"   "finishes" "body"     "fade"    

Bar charts are easy to construct. This one shows the “Zipf” relationshiop rather clearly (at least when the stop words have been included). The function tibble constructs a tidy data frame.

tibble(word=names(mj), frequency=mj)      %>%
    top_n(25,frequency)                   %>%
    mutate(word=reorder(word, frequency)) %>%
    ggplot(aes(word,frequency)) +
    geom_col() + coord_flip()

You can also draw word clouds to summzarize the most common types; eye candy can be useful to attract attention (though it makes it difficult to compare the frequencies… quick, which is the 5th most common word). Don’t try to show too many words. Removing stop words would be very useful in this case.

require(wordcloud)
set.seed(133)  # random locations; fix the seed to be able to reproduce
wordcloud(names(mj), mj, max.words=50)

The function zipf_plot from the helper file \({\tt text\_utils.R}\) shows the Zipf plot. By default, it fits a least squares lines to the first 250 frequencies.

zipf_plot(mj)

Call:
lm(formula = ly ~ lx, data = df[1:min(n.fit, nrow(df)), ])

Coefficients:
(Intercept)           lx  
    11.2897      -0.9475