This notebook provides an example of a sentiment analysis, a dictionary based method that is easy computed from the document term matrix.
The methods in this notebook add another package to the standard list.
require(tm)
Loading required package: tm
Loading required package: NLP
require(tidytext) # has dictionaries for sentiment analysis
Loading required package: tidytext
require(stringr) # not in the tidyverse
Loading required package: stringr
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter(): dplyr, stats
lag(): dplyr, stats
The tidytext
package has 4 sentiment dictionaries. You can get a list of these from get_sentiments
.
# get_sentiments("")
View
is a very nice function to use in R-Studio. You cannot edit the data, just view it; it does, however, have some very nice sorting features. (If the option eval=FALSE is set in the header to an “R chunk”, R-Studio won’t evaluate that chunk unless you do it manually.)
View(get_sentiments('bing'))
The sorting is particularly handy in dictionaries with a variety of codings (Loughran and NRC).
View(get_sentiments('loughran'))
Negative words tend to dominate dictionaries. For example the Bing dictionary is dominated by negative words.
table(get_sentiments('bing')$sentiment)
negative positive
4782 2006
Read the wine data from its CSV file. If this corpus were larger and the calculations took longer, it would be a good idea to save the “processed” data in a “.sav” file rather than process it again here.
Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
[1] 20508 14
WineCorpus <- Corpus(VectorSource(Wine$description))
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.') # otherwise runs together; dot is special regex
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english"))
Sentiment analysis can be done efficiently from the document-term matrix because the DTM has the counts of all the words in the documents. (See the tidytext book for a different approach.)
Now compute the document term matrix, along with the row and column marginal counts. (The DTM is a little smaller, with fewer types – 5,412 – here than in the first slides because of handling the comma differently and removing the stop words.)
dtm <- DocumentTermMatrix(WineCorpus)
dtm
<<DocumentTermMatrix (documents: 20508, terms: 5412)>>
Non-/sparse entries: 475966/110513330
Sparsity : 100%
Maximal term length: 15
Weighting : term frequency (tf)
ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))
word.types <- names(mj) # for convenience and clarity
Looks like I have finally gotten rid of the run-together words (at least those that lead to long word types).
word.types[j <- which.max(str_length(word.types))]
[1] "extraordinarily"
I’ll do the sentiment analysis of the wines using the Bing lexicon of positive and negative words. (As some practice you should try do do this with the AFINN lexicon, the lexicon that assigns weights rather than just positive or negative.)
Bing <- get_sentiments('bing')
dim(Bing)
[1] 6788 2
Which of these words appear in the wine corpus? Pick out the word types that both appear in the wine corpus and in the Bing lexicon. The other are not relevant for the sentiment calculations.
keep.types <- intersect(Bing$word, word.types)
length(keep.types)
[1] 722
dplyr
is handy for filtering the Bing data frame, picking the word types that meet this condition. Also add a numeric score to simplify a later calculation and keep the positive and negative terms separated.
Bing <- Bing %>%
filter(word %in% keep.types) %>%
mutate(pos = ifelse(sentiment=='positive',+1,0),
neg = ifelse(sentiment=='negative',+1,0),
score = pos - neg)
dim(Bing)
[1] 722 5
Have a peek, noticing that the words are sorted alphabetically.
View(Bing)
Now filter the DTM. the dtm
object is a matrix, so use indices.
bing.dtm <- dtm[,word.types %in% keep.types]
bing.dtm
<<DocumentTermMatrix (documents: 20508, terms: 722)>>
Non-/sparse entries: 74785/14731991
Sparsity : 99%
Maximal term length: 15
Weighting : term frequency (tf)
Get the columns lined up. We want to make sure that the columns of the DTM are align with the elements of the sentiment dictionary. (Yes, its a little tricky to manage matrices mixed with data frames, but not that hard since we have one of each.)
counts <- bing.dtm[,Bing$word]
any(colnames(counts) != Bing$word)
[1] FALSE
Now counting the number of positive and negative words is easy (easy if you recognize the matrix connection). It’s a matrix multiplication; the call to as.vector
converts the 1-column matrix into a vector.
rating.sentiment <- as.vector(as.matrix(bing.dtm) %*% Bing$score)
Since longer reviews accompany better wines, it is not too surprising that the typical net sentiment is also positive.
summary(rating.sentiment)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-7.000 0.000 2.000 1.796 3.000 12.000
hist(rating.sentiment)
Are these sentiments related to the number of points assigned to the wine?
plot(rating.sentiment, Wine$points, xlab="Sentiment", ylab="Points")
Some “dithering” (adding random variation – a bit of fuzz – to avoid over-printing) improves the plot, but it’s still overwhelmed by the volume of points. So I just drew a subset of the points.
dither <- function(x) return (x + rnorm(length(x),sd=0.05*sd(x, na.rm=TRUE)))
dither(1:10)
[1] 1.418513 2.111122 2.927421 3.993517 5.222360 5.870933 6.940083 8.066629 9.233551
[10] 10.081846
i <- sample(1:length(rating.sentiment), 5000)
plot(dither(rating.sentiment[i]), dither(Wine$points[i]))
And if you make a data frame, ggplot
is prettier still. It’s also then easy to add the regression line. There’s association, but it is not very strong.
data_frame(points = Wine$points, sentiment=rating.sentiment) %>%
ggplot(aes(sentiment,points)) +
geom_jitter(alpha=0.1) + # alpha determines the opacity
geom_smooth(method='lm')
The fit is significant, but explains very little variation.
summary(lm(Wine$points ~ rating.sentiment))
Call:
lm(formula = Wine$points ~ rating.sentiment)
Residuals:
Min 1Q Median 3Q Max
-7.7359 -2.1830 -0.0652 1.9348 12.2762
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 86.39450 0.02878 3001.73 <2e-16 ***
rating.sentiment 0.22356 0.01064 21.01 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.07 on 20327 degrees of freedom
(179 observations deleted due to missingness)
Multiple R-squared: 0.02126, Adjusted R-squared: 0.02121
F-statistic: 441.6 on 1 and 20327 DF, p-value: < 2.2e-16
Who says negative words and positive words should have the same weight?
rating.pos <- as.vector(as.matrix(bing.dtm) %*% Bing$pos)
rating.neg <- as.vector(as.matrix(bing.dtm) %*% Bing$neg)
any(rating.pos-rating.neg != rating.sentiment) # all should match!
[1] FALSE
It seems like negative words have a much stronger connection to the points than those effusive positive words, with some noticeable nonlinearity too. The interaction is not significant, but the nonlinear terms (squares) are, particularly the for the count of negative words.
summary(lm(Wine$points ~ rating.pos * rating.neg + I(rating.pos^2) + I(rating.neg^2)))
Call:
lm(formula = Wine$points ~ rating.pos * rating.neg + I(rating.pos^2) +
I(rating.neg^2))
Residuals:
Min 1Q Median 3Q Max
-11.1680 -1.9064 0.0936 1.8084 12.1369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.800892 0.063968 1325.670 < 2e-16 ***
rating.pos 0.582570 0.032998 17.655 < 2e-16 ***
rating.neg 0.293899 0.054162 5.426 5.82e-08 ***
I(rating.pos^2) -0.014903 0.004193 -3.554 0.00038 ***
I(rating.neg^2) 0.096777 0.011990 8.072 7.31e-16 ***
rating.pos:rating.neg 0.019317 0.011319 1.707 0.08791 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.883 on 20323 degrees of freedom
(179 observations deleted due to missingness)
Multiple R-squared: 0.137, Adjusted R-squared: 0.1368
F-statistic: 645.4 on 5 and 20323 DF, p-value: < 2.2e-16
Because we have a response (namely, the points awarded to a wine), we can get a hint of how to make your own task-specific dictionary. Let’s focus for this example on the words that show up frequently, those that show up in at least 500 times in the tasting notes. (Note: If this were “for real”, I would need to reserve a substantial portion of the tasting notes for a test data set.)
sum(500 <= mj)
[1] 175
Now regress the points on the number of times these words appear.
X <- as.matrix(dtm[,500 <= mj])
Most only occur once in a document.
sum(X)
[1] 360936
sum(1<X)
[1] 12645
Now do the big regression. Most of the wines have the points variable.
i <- which(!is.na(Wine$points))
length(i)
[1] 20329
y <- Wine$points[i]
X <- X[i,]
Run the regression of the points on these counts of words.
regr <- lm(y ~ X)
summary(regr)$r.squared
[1] 0.5755442
When looking at the estimates, it seems “lemon” is indeed a negative word when it comes to wine tasting points, but the effect is very small.
b <- coef(regr)[-1] # drop the intercept
b <- b[order(abs(b), decreasing=TRUE)]
b[1:20] # big ones
Xgreat Xlong Xquick Xlengthy Xdelicious Xexcellent Xlush Xelegant
1.7001641 1.6775833 -1.6437744 1.6311694 1.5958857 1.3278610 1.0956059 1.0868234
Xdrink Xshort Xstructured Xfull Xvibrant Xrich Xwill Xmodest
-1.0501087 -0.9903800 0.9339117 0.9140514 0.8840144 0.8488617 0.8426404 -0.8184256
Xflavorful Xzesty Xhoneyed Xsilky
0.8154570 0.8080806 0.7904664 0.7619647
tail(b, n=20) # small ones
Xlemon Xthyme Xcedar Xround Xfollow Xcherry Xlightly
-0.035476250 -0.035210841 -0.033114671 -0.031758445 -0.028855524 -0.026968057 -0.025587830
Xberries Xwhite Xplum Xgrilled Xdry Xoak Xmedium
-0.025236926 -0.021022113 0.020827858 -0.016786299 0.015965913 -0.014241626 -0.012870548
Xpleasant Xacidity Xbodied Xdried Xblackberry Xpeppery
0.011626942 0.009080301 0.007850889 -0.006091515 0.002402311 0.001943533
Just because you have a large estimated coefficient doesn’t mean that the word should be scored with that weight – you should also check for statistical precision. Some rounding and perhaps shrinkage toward zero would also make sense, perhaps by ridge or lasso methods. You don’t see weights in lexicons with several digits of precision.
Finally, R has “messed up” the names by prefixing the name of the matrix “X”. We can fix that easily with the string tools: use a regular expression to replace the first character by nothing.
str_replace(names(b),".","")[1:10]
[1] "great" "long" "quick" "lengthy" "delicious" "excellent" "lush"
[8] "elegant" "drink" "short"
names