This notebook illustrates the use of vector space methods in R. These manipulate the document-term matrix and can be used to find word embeddings. ]
The methods in this notebook add another package to the standard list.
source("text_utils.R") # from web page
Read the wine data from its CSV file. Rather than do this every time, it is generally a good idea to save the “processed” data in a “.sav” file.
Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
[1] 20508 14
WineCorpus <- Corpus(VectorSource(Wine$description))
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.') # otherwise runs together; dot is special regex
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
# WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) # leave for now
Now compute the document term matrix and the row ni
and column mj
marginal counts. The DTM is a little smaller, with fewer types – 5,488 – here than in the first slides because of handling the comma differently. We will be making it smaller still.
dtm <- DocumentTermMatrix(WineCorpus)
ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))
word.types <- names(mj) # for convenience and clarity
As usual, check the name of the longest type for possible errors. This one is okay.
word.types[j <- which.max(str_length(word.types))]
[1] "extraordinarily"
has the function findFreqTerms
to extract the most frequent terms in the DTM (not that this is hard to do directly). Start with a high treshold to avoid too many.
Bar charts are easy to construct. This one shows the “Zipf” relationshiop rather clearly (at least when the stop words have been included). The function tibble
constructs a tidy data frame.
tibble(word=names(mj), frequency=mj) %>%
top_n(25,frequency) %>%
mutate(word=reorder(word, frequency)) %>%
ggplot(aes(word,frequency)) +
geom_col() + coord_flip()
You can also draw word clouds to summzarize the most common types; eye candy can be useful to attract attention (though it makes it difficult to compare the frequencies… quick, which is the 5th most common word). Don’t try to show too many words. Removing stop words would be very useful in this case.
set.seed(133) # random locations; fix the seed to be able to reproduce
wordcloud(names(mj), mj, max.words=50)
The function zipf_plot
from the helper file \({\tt text\_utils.R}\) shows the Zipf plot. By default, it fits a least squares lines to the first 250 frequencies.
