This notebook describes some basics of text modeling in R. In particular, the example highlights building a document-term matrix using the package tm
. The example concludes by finding words associated with low and high values of a numerical variable, in this case, prices of wine.
I use these packages so frequently I will load them into R up front. Others that are occasionally useful will be loaded as needed. (Unlike library
, require
loads a package only if if it was not already present in the working environment.)
require(tm)
Loading required package: tm
Loading required package: NLP
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------------------------------------------------------------
annotate(): ggplot2, NLP
filter(): dplyr, stats
lag(): dplyr, stats
require(stringr)
Loading required package: stringr
Start by reading the wine data. It is in a CSV file. The function read_csv
creates a “tidy” data frame that, for example, does not convert text to factors by default. It comes from the readr
package, part of tidyverse
. (You will need to download these data to your computer and use the appropriate path.)
Wine <- read_csv("../data/Wine.csv") # I capitalize the names of data frames
Parsed with column specification:
cols(
review = col_integer(),
id = col_integer(),
label = col_character(),
description = col_character(),
type = col_character(),
alcohol = col_integer(),
location = col_character(),
date = col_character(),
rating = col_character(),
variety = col_character(),
vintage = col_integer(),
color = col_character(),
points = col_integer(),
price = col_double()
)
number of columns of result is not a multiple of vector length (arg 1)4945 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 3786 alcohol no trailing characters .4 '../data/Wine.csv' file 2 3787 alcohol no trailing characters .2 '../data/Wine.csv' row 3 3788 alcohol no trailing characters .5 '../data/Wine.csv' col 4 3857 alcohol no trailing characters .2 '../data/Wine.csv' expected 5 3858 alcohol no trailing characters .4 '../data/Wine.csv'
... ................. ... ................................................................ ........ ................................................................ ...... ................................................................ .... ................................................................ ... ................................................................ ... ................................................................ ........ ................................................................
See problems(...) for more details.
dim(Wine)
[1] 20508 14
The function read_csv
gets upset because it expects the alcohol variable to be integer valued, but then it discovers some decimal points. We can get read_csv
to ignore this problem by telling it that values of the alcohol variable are doubles (or perhaps better fix those data values to remove the decimals).
Wine <- read_csv("../data/Wine.csv", col_types=cols(alcohol='d'))
dim(Wine)
[1] 20508 14
summary(Wine)
review id label description type
Min. : 1 Min. :163522 Length:20508 Length:20508 Length:20508
1st Qu.: 5335 1st Qu.:171463 Class :character Class :character Class :character
Median :10572 Median :179542 Mode :character Mode :character Mode :character
Mean :10538 Mean :180035
3rd Qu.:15758 3rd Qu.:188588
Max. :20888 Max. :198253
alcohol location date rating variety
Min. : 1.00 Length:20508 Length:20508 Length:20508 Length:20508
1st Qu.:13.00 Class :character Class :character Class :character Class :character
Median :13.50 Mode :character Mode :character Mode :character Mode :character
Mean :13.39
3rd Qu.:14.00
Max. :41.00
NA's :537
vintage color points price
Min. :1969 Length:20508 Min. :79.0 Min. : 1.99
1st Qu.:2001 Class :character 1st Qu.:85.0 1st Qu.: 11.80
Median :2004 Mode :character Median :87.0 Median : 16.00
Mean :2004 Mean :86.8 Mean : 21.30
3rd Qu.:2007 3rd Qu.:89.0 3rd Qu.: 25.00
Max. :2011 Max. :99.0 Max. :2006.00
NA's :2075 NA's :179 NA's :1923
Now focus on the column description
that holds the tasting notes.
Wine$description[1:4]
[1] "Lemon oil and grapefruit aromas follow through on a medium-bodied palate with impressive wieght and a dry, tart finish."
[2] "Bacon fat, black cherry, dill, oak aromas. A rich entry leads to a moderately full-bodied palate with forward fruit and a finish that offers sleek tannins and fine acidity. A more subtle style of California Syrah."
[3] "Earthy, herbal, slightly herbaceous aromas. A medium-bodied palate leads to a short finish that is earthy, tart and has limited fruit."
[4] "Cedar, cherry tomato, and herbal aromas. A rich entry leads to a moderately full-bodied palate with forward fruit and a big finish that offers ripe fruit, moderate tannins and acidity."
Wine$description[1:5]
The prices are a bit skewed.
Wine %>%
ggplot(aes(x=price)) + geom_histogram() + scale_x_log10()
And differ slightly between red versus white wines. The comparison is easier to visualize with frequency polygons that are not filled in.
Wine %>%
filter(!is.na(color)) %>%
ggplot(aes(x=price, ..density.., color=color)) + geom_freqpoly() + scale_x_log10()
As usual with R, there are lots of alternative displays, such as side-by-side boxplots that tell a similar story. (ggplot2
generates “pretty” graphs, but you might find the syntax overwhelming at first.)
boxplot(price ~ color, data=Wine, log='y')
The next task seems obvious, but has a lasting impact: What’s a word? Is a number a word? What about punctuation? Are “subject” and “subjects” different words? tm
has a collection of tools for taking care of these tasks, but you need to decide which to use.
To get started, put the text into a corpus
object. A corpus is usually created when using tm
when a collection of documents is read into R. tm
nicely handles many different types of documents, including PDF and Word files.
WineCorpus <- Corpus(VectorSource(Wine$description))
WineCorpus
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 20508
A corpus object in tm
is a decorated list, a list that has been adorned with special attributes when created. That means we can peek into the corpus as if it were a list by referring to the number elements of the list (which are the documents).
is.list(WineCorpus)
[1] TRUE
WineCorpus[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 119
To see the text itself, use the inspect
command.
inspect(WineCorpus[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 119
Lemon oil and grapefruit aromas follow through on a medium-bodied palate with impressive wieght and a dry, tart finish.
Now comes the fun: tokenizing the text of the corpus. How should this text be represented as words? There are many choices. (BTW, removeWords
only removes selected words, not them all!)
getTransformations() # defined in tm
[1] "removeNumbers" "removePunctuation" "removeWords" "stemDocument"
[5] "stripWhitespace"
If something you want to do is not among these – or if you want finer control – define your own content transformation by mimicking the following style.
# to convert a misspelled word
toCorrect <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
# to convert some pattern of text to a space
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
# to convert text to lower case
toLower <- content_transformer(function(text) tolower(text))
Removing common stop words (such as “the”, “a”, “an”, …) is often done as well, using the function removeWords
. Here’s a sample of the stopwords that are defined in tm
. (tm
has collections of stopwords for other languages, so you have to pick “english” in this case.)
length(stopwords('english'))
[1] 174
stopwords('english') # show just the first 10
[1] "i" "me" "my" "myself" "we" "our" "ours"
[8] "ourselves" "you" "your" "yours" "yourself" "yourselves" "he"
[15] "him" "his" "himself" "she" "her" "hers" "herself"
[22] "it" "its" "itself" "they" "them" "their" "theirs"
[29] "themselves" "what" "which" "who" "whom" "this" "that"
[36] "these" "those" "am" "is" "are" "was" "were"
[43] "be" "been" "being" "have" "has" "had" "having"
[50] "do" "does" "did" "doing" "would" "should" "could"
[57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're"
[64] "they're" "i've" "you've" "we've" "they've" "i'd" "you'd"
[71] "he'd" "she'd" "we'd" "they'd" "i'll" "you'll" "he'll"
[78] "she'll" "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't"
[85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" "won't"
[92] "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't"
[99] "let's" "that's" "who's" "what's" "here's" "there's" "when's"
[106] "where's" "why's" "how's" "a" "an" "the" "and"
[113] "but" "if" "or" "because" "as" "until" "while"
[120] "of" "at" "by" "for" "with" "about" "against"
[127] "between" "into" "through" "during" "before" "after" "above"
[134] "below" "to" "from" "up" "down" "in" "out"
[141] "on" "off" "over" "under" "again" "further" "then"
[148] "once" "here" "there" "when" "where" "why" "how"
[155] "all" "any" "both" "each" "few" "more" "most"
[162] "other" "some" "such" "no" "nor" "not" "only"
[169] "own" "same" "so" "than" "too" "very"
Stemming converts words to remove variations produced by plurals or adding tense to a base verb (trims off the trailing ‘s’, ‘es’, resulting in a smaller vocabulary.
Typically many of these transformations are applied, often with certain words in mind.
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, toCorrect, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/') # otherwise runs together
WineCorpus <- tm_map(WineCorpus, removePunctuation) # might not be right (!)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
# WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) # leave for now
# WineCorpus <- tm_map(WineCorpus, removeNumbers) # not many around
# WineCorpus <- tm_map(WineCorpus, removeWords, c('yuck')) # specific word(s)
inspect(WineCorpus[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 117
lemon oil and grapefruit aromas follow through on a medium bodied palate with impressive weight and a dry tart finish
The key object for our analysis is known as a document term matrix (or, when transposed, a term document matrix). It contains the counts of every word type in each document. Each of the 20,508 rows represents a document, and each of the 6,385 columns identifies a word type. Because most of the matrix entries are zeros, it is held in “sparse” format. (Notice that you cannot recover the source corpus from the document term matrix. This matrix represents each document as a “bag of words”.)
dtm <- DocumentTermMatrix(WineCorpus)
dim(dtm)
[1] 20508 5641
We can start to do statistics now – counting. For example, all but 545,707 elements of the \(20,508 \times 5,641 = 115,685,628 \approx 116\) million counts in the document term matrix are zero. (This is the count if you have not removed the stopwords; the number of types is smaller with the stopwords removed.)
dtm
<<DocumentTermMatrix (documents: 20508, terms: 5641)>>
Non-/sparse entries: 545707/115139921
Sparsity : 100%
Maximal term length: 20
Weighting : term frequency (tf)
It is now simple to use matrix functions from R to find the number of words in each document and the number of times each type appears (albeit at the cost of converting the sparse matrix into a dense matrix in order to use rowSums
and colSums
.).
ni <- rowSums(as.matrix(dtm)) # tokens in each document
mj <- colSums(as.matrix(dtm)) # columns are named by the word types; frequency of each
Check a few of the terms to make sure that the data appear okay. If you don’t spend time getting the data ready, you will find lots of issues.
j <- which.max(str_length(names(mj)))
j
[1] 4149
names(mj)[j]
[1] "blackberrypeppercorn"
To see if this is real text, you have to find the one document that has this text. You can see that its just in one from the count for this type.
mj[j]
blackberrypeppercorn
1
which(0 != as.vector(dtm[,4149]))
[1] 12333
Here’s the relevant portion of the original source:
“Creme brulee, blackberry,peppercorn, and mocha aromas. A soft, silky…”
There’s no space around that comma between “blackberry” and “peppercorn” and tm
has collapsed the two words together. We could fix this by adding a comma to the list of punctuation to turn into spaces rather than just remove.
It is hard to imagine a distribution of counts that is more skewed than the counts of the word types (left).
par(mfrow=c(1,2))
hist(mj, breaks=50, main="Counts of Word Types")
hist(ni, breaks=50, main="Words per Document")
Even after taking logs, the counts remain skewed! This is common in text. “Tokens are common, but types are rare.”
hist(log(mj), breaks=50, main="Counts of Word Types")
The frequency counts in mj
are named and in alphabetical order. We can use these names to produce a bar graph of the most common words with ggplot
. Stopwords such as “and” and “this” are common.
Freq <- data_frame(type = names(mj), count = mj) # ggplot and dplyr want data frames
Freq %>%
top_n(25, count) %>%
mutate(type=reorder(type,count)) %>% # rather than alphabetical order
ggplot(aes(type,count)) + geom_col() + coord_flip()
Let’s see what happens without the stop words. (The following code should run, but dies a horrible death! I suspect R ran out of memory along the way. Instead, to remove the stopwords, use tm_map
as shown above.)
Freq %>%
filter(!type %in% stopwords('english')) %>% # be careful about syntax here or it will crash!
top_n(25, count) %>%
mutate(type=reorder(type,count)) %>% # rather than alphabetical order
ggplot(aes(type,count)) + geom_col() + coord_flip()
This is a good chance to check whether the frequencies of the word types matches a Zipf distribution commonly associated with text.
A Zipf distribution is characterized by a power law: the frequency of word types is inversely proportional to rank, \(f_k \propto 1/k\). Said differently, the frequency of the second most common word is half that of the most common, the frequency of the third is one-third the most common, etc. A little algebra shows that for this to occur, then \(\log p_k \approx b_0 - \log k\). That is, a plot of the log of the frequencies should be linear in the log of the rank \(k\), with slope near -1.
Freq %>%
arrange(desc(count)) %>% # decreasing by count
mutate(rank=row_number()) %>% # add row number
ggplot(aes(x=log(rank), y=log(count))) +
geom_point() +
# geom_smooth(method='lm', se=FALSE) +
geom_abline(slope=-1, intercept=11, color='red')
The least squares slope (commented out or shown in blue) is steeper, being dominated by the many less common words. You can mitigate that effect by weighting the regression by the counts.
Temp <- Freq %>% mutate(rank=row_number())
lm(log(count) ~ log(rank), data=Temp, weights=sqrt(count))
Call:
lm(formula = log(count) ~ log(rank), data = Temp, weights = sqrt(count))
Coefficients:
(Intercept) log(rank)
12.104 -1.166
Before concluding this short introduction, lets relate the word types to prices. Which word types are associated with pricy wines, and which with cheaper wines?
To find out, combine the information in the document term matrix with prices from the Wine
data frame. First fix that weirdo price found using JMP earlier.
max(Wine$price, na.rm=TRUE) # missing values are contageous in R
[1] 2006
i <- which.max(Wine$price) # handles the NA by default
i
[1] 12508
Wine$price[i] <- NA
max(Wine$price, na.rm=TRUE)
[1] 571
To keep things manageable, consider words that appear, say, at least 10 times in the corpus. That’s still 1,776 types. The matrix counts
has the counts for each word type. We can find the word with the highest average price by turning these into 0/1 indicators and using a matrix product.
counts <- as.matrix(dtm[,9<mj])
dim(counts)
[1] 20508 1742
count.names <- colnames(counts) # save for later
count.names[1:10]
[1] "and" "aromas" "bodied" "dry" "finish" "follow" "grapefruit"
[8] "impressive" "lemon" "medium"
The word “and” appears many times, so turn these integers into indicators (1 if the word type appears in a note at least once and 0 if not present).
counts[1:10]
[1] 2 2 1 3 2 3 1 1 2 4
pmin
does element-by-element comparison, but converts the matrix into a vector. So, turn the vector of 0/1s back into a matrix.
counts <- as.matrix(pmin(counts,1),nrow=20508)
counts[1:10]
[1] 1 1 1 1 1 1 1 1 1 1
A quick check of the prices, identifying those not missing.
not.missing <- ! is.na(Wine$price)
min(Wine$price[not.missing]) # Two-buck chuck?
[1] 1.99
Finally, find the average price.
avg.price <- (Wine$price[not.missing] %*% counts[not.missing,])/colSums(counts[not.missing,])
Note that some averages are ‘NaN’ because a word type does not appear in wines with known prices. For example, the word promising
is common, but not found in the descriptions of wines with known price.
min(avg.price)
[1] NaN
Which words come with the high prices?
names(avg.price) <- count.names
sort(avg.price, decreasing=TRUE)[1:20]
incredibly potential finest measures decade requires mignon
82.12375 79.06857 70.57333 69.14000 69.08652 67.20000 67.12108
exquisite filet class spinach champagne barrique underlying
66.09000 62.21878 61.85643 61.73389 60.68243 60.47250 59.61385
rancio unctuous underneath proportioned stream outstanding
59.20750 58.39800 55.92455 55.17400 54.81385 54.52350
And which with low prices?
sort(avg.price)[1:20]
money aromatically dishwater reductive burger chemical carefree
9.035909 9.450000 9.796000 10.323333 10.543548 10.807778 10.902105
quaffer picnic everyday summer value burst meager
11.030993 11.120179 11.190667 11.271585 11.516320 11.656528 11.661429
sherbet price quaffing moscato shirt leaning
11.666667 11.893312 11.988636 12.069231 12.171818 12.328333
These notes continue with this analysis, considering the singular value decomposition of the counts in the document term matrix.