This notebook provides an example of a sentiment analysis, a dictionary based method that is easy computed from the document term matrix.

Setup R

The methods in this notebook add another package to the standard list.

require(tm)
Loading required package: tm
Loading required package: NLP
require(tidytext)   # has dictionaries for sentiment analysis
Loading required package: tidytext
require(stringr)    # not in the tidyverse
Loading required package: stringr
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter():   dplyr, stats
lag():      dplyr, stats

Sentiment analysis

The tidytext package has 4 sentiment dictionaries. You can get a list of these from get_sentiments.

# get_sentiments("")

View is a very nice function to use in R-Studio. You cannot edit the data, just view it; it does, however, have some very nice sorting features. (If the option eval=FALSE is set in the header to an “R chunk”, R-Studio won’t evaluate that chunk unless you do it manually.)

View(get_sentiments('bing'))

The sorting is particularly handy in dictionaries with a variety of codings (Loughran and NRC).

View(get_sentiments('loughran'))

Negative words tend to dominate dictionaries. For example the Bing dictionary is dominated by negative words.

table(get_sentiments('bing')$sentiment)

negative positive 
    4782     2006 

Read the wine data from its CSV file. If this corpus were larger and the calculations took longer, it would be a good idea to save the “processed” data in a “.sav” file rather than process it again here.

Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
[1] 20508    14
WineCorpus <- Corpus(VectorSource(Wine$description))
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.')     # otherwise runs together; dot is special regex
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) 

Sentiment analysis can be done efficiently from the document-term matrix because the DTM has the counts of all the words in the documents. (See the tidytext book for a different approach.)

Now compute the document term matrix, along with the row and column marginal counts. (The DTM is a little smaller, with fewer types – 5,412 – here than in the first slides because of handling the comma differently and removing the stop words.)

dtm <- DocumentTermMatrix(WineCorpus)
dtm
<<DocumentTermMatrix (documents: 20508, terms: 5412)>>
Non-/sparse entries: 475966/110513330
Sparsity           : 100%
Maximal term length: 15
Weighting          : term frequency (tf)
ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))
word.types <- names(mj)   # for convenience and clarity

Looks like I have finally gotten rid of the run-together words (at least those that lead to long word types).

word.types[j <- which.max(str_length(word.types))]
[1] "extraordinarily"

I’ll do the sentiment analysis of the wines using the Bing lexicon of positive and negative words. (As some practice you should try do do this with the AFINN lexicon, the lexicon that assigns weights rather than just positive or negative.)

Bing <- get_sentiments('bing')
dim(Bing)
[1] 6788    2

Which of these words appear in the wine corpus? Pick out the word types that both appear in the wine corpus and in the Bing lexicon. The other are not relevant for the sentiment calculations.

keep.types <- intersect(Bing$word, word.types)
length(keep.types)
[1] 722

dplyr is handy for filtering the Bing data frame, picking the word types that meet this condition. Also add a numeric score to simplify a later calculation and keep the positive and negative terms separated.

Bing <- Bing %>%
    filter(word %in% keep.types) %>%
    mutate(pos = ifelse(sentiment=='positive',+1,0),
           neg = ifelse(sentiment=='negative',+1,0),
           score = pos - neg)
dim(Bing)
[1] 722   5

Have a peek, noticing that the words are sorted alphabetically.

View(Bing)

Now filter the DTM. the dtm object is a matrix, so use indices.

bing.dtm <- dtm[,word.types %in% keep.types]
bing.dtm
<<DocumentTermMatrix (documents: 20508, terms: 722)>>
Non-/sparse entries: 74785/14731991
Sparsity           : 99%
Maximal term length: 15
Weighting          : term frequency (tf)

Get the columns lined up. We want to make sure that the columns of the DTM are align with the elements of the sentiment dictionary. (Yes, its a little tricky to manage matrices mixed with data frames, but not that hard since we have one of each.)

counts <- bing.dtm[,Bing$word]
any(colnames(counts) != Bing$word)
[1] FALSE

Now counting the number of positive and negative words is easy (easy if you recognize the matrix connection). It’s a matrix multiplication; the call to as.vector converts the 1-column matrix into a vector.

rating.sentiment <- as.vector(as.matrix(bing.dtm) %*% Bing$score)

Since longer reviews accompany better wines, it is not too surprising that the typical net sentiment is also positive.

summary(rating.sentiment)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -7.000   0.000   2.000   1.796   3.000  12.000 
hist(rating.sentiment)

Are these sentiments related to the number of points assigned to the wine?

plot(rating.sentiment, Wine$points, xlab="Sentiment", ylab="Points")

Some “dithering” (adding random variation – a bit of fuzz – to avoid over-printing) improves the plot, but it’s still overwhelmed by the volume of points. So I just drew a subset of the points.

dither <- function(x) return (x + rnorm(length(x),sd=0.05*sd(x, na.rm=TRUE)))
dither(1:10)
 [1]  1.418513  2.111122  2.927421  3.993517  5.222360  5.870933  6.940083  8.066629  9.233551
[10] 10.081846
i <- sample(1:length(rating.sentiment), 5000)
plot(dither(rating.sentiment[i]), dither(Wine$points[i]))

And if you make a data frame, ggplot is prettier still. It’s also then easy to add the regression line. There’s association, but it is not very strong.

data_frame(points = Wine$points, sentiment=rating.sentiment) %>%
    ggplot(aes(sentiment,points)) +
    geom_jitter(alpha=0.1) +        # alpha determines the opacity
    geom_smooth(method='lm')

The fit is significant, but explains very little variation.

summary(lm(Wine$points ~ rating.sentiment))

Call:
lm(formula = Wine$points ~ rating.sentiment)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7359 -2.1830 -0.0652  1.9348 12.2762 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      86.39450    0.02878 3001.73   <2e-16 ***
rating.sentiment  0.22356    0.01064   21.01   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.07 on 20327 degrees of freedom
  (179 observations deleted due to missingness)
Multiple R-squared:  0.02126,   Adjusted R-squared:  0.02121 
F-statistic: 441.6 on 1 and 20327 DF,  p-value: < 2.2e-16

Who says negative words and positive words should have the same weight?

rating.pos <- as.vector(as.matrix(bing.dtm) %*% Bing$pos)
rating.neg <- as.vector(as.matrix(bing.dtm) %*% Bing$neg)
any(rating.pos-rating.neg != rating.sentiment)   # all should match!
[1] FALSE

It seems like negative words have a much stronger connection to the points than those effusive positive words, with some noticeable nonlinearity too. The interaction is not significant, but the nonlinear terms (squares) are, particularly the for the count of negative words.

summary(lm(Wine$points ~ rating.pos * rating.neg + I(rating.pos^2) + I(rating.neg^2)))

Call:
lm(formula = Wine$points ~ rating.pos * rating.neg + I(rating.pos^2) + 
    I(rating.neg^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-11.1680  -1.9064   0.0936   1.8084  12.1369 

Coefficients:
                       Estimate Std. Error  t value Pr(>|t|)    
(Intercept)           84.800892   0.063968 1325.670  < 2e-16 ***
rating.pos             0.582570   0.032998   17.655  < 2e-16 ***
rating.neg             0.293899   0.054162    5.426 5.82e-08 ***
I(rating.pos^2)       -0.014903   0.004193   -3.554  0.00038 ***
I(rating.neg^2)        0.096777   0.011990    8.072 7.31e-16 ***
rating.pos:rating.neg  0.019317   0.011319    1.707  0.08791 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.883 on 20323 degrees of freedom
  (179 observations deleted due to missingness)
Multiple R-squared:  0.137, Adjusted R-squared:  0.1368 
F-statistic: 645.4 on 5 and 20323 DF,  p-value: < 2.2e-16

Make your own dictionary

Because we have a response (namely, the points awarded to a wine), we can get a hint of how to make your own task-specific dictionary. Let’s focus for this example on the words that show up frequently, those that show up in at least 500 times in the tasting notes. (Note: If this were “for real”, I would need to reserve a substantial portion of the tasting notes for a test data set.)

sum(500 <= mj)
[1] 175

Now regress the points on the number of times these words appear.

X <- as.matrix(dtm[,500 <= mj])

Most only occur once in a document.

sum(X)
[1] 360936
sum(1<X)
[1] 12645

Now do the big regression. Most of the wines have the points variable.

i <- which(!is.na(Wine$points))
length(i)
[1] 20329
y <- Wine$points[i]
X <- X[i,]

Run the regression of the points on these counts of words.

regr <- lm(y ~ X)
summary(regr)$r.squared
[1] 0.5755442

When looking at the estimates, it seems “lemon” is indeed a negative word when it comes to wine tasting points, but the effect is very small.

b <- coef(regr)[-1]  # drop the intercept
b <- b[order(abs(b), decreasing=TRUE)]
b[1:20]  #  big ones
     Xgreat       Xlong      Xquick    Xlengthy  Xdelicious  Xexcellent       Xlush    Xelegant 
  1.7001641   1.6775833  -1.6437744   1.6311694   1.5958857   1.3278610   1.0956059   1.0868234 
     Xdrink      Xshort Xstructured       Xfull    Xvibrant       Xrich       Xwill     Xmodest 
 -1.0501087  -0.9903800   0.9339117   0.9140514   0.8840144   0.8488617   0.8426404  -0.8184256 
 Xflavorful      Xzesty    Xhoneyed      Xsilky 
  0.8154570   0.8080806   0.7904664   0.7619647 
tail(b, n=20)  # small ones
      Xlemon       Xthyme       Xcedar       Xround      Xfollow      Xcherry     Xlightly 
-0.035476250 -0.035210841 -0.033114671 -0.031758445 -0.028855524 -0.026968057 -0.025587830 
    Xberries       Xwhite        Xplum     Xgrilled         Xdry         Xoak      Xmedium 
-0.025236926 -0.021022113  0.020827858 -0.016786299  0.015965913 -0.014241626 -0.012870548 
   Xpleasant     Xacidity      Xbodied       Xdried  Xblackberry     Xpeppery 
 0.011626942  0.009080301  0.007850889 -0.006091515  0.002402311  0.001943533 

Just because you have a large estimated coefficient doesn’t mean that the word should be scored with that weight – you should also check for statistical precision. Some rounding and perhaps shrinkage toward zero would also make sense, perhaps by ridge or lasso methods. You don’t see weights in lexicons with several digits of precision.

Finally, R has “messed up” the names by prefixing the name of the matrix “X”. We can fix that easily with the string tools: use a regular expression to replace the first character by nothing.

str_replace(names(b),".","")[1:10]
 [1] "great"     "long"      "quick"     "lengthy"   "delicious" "excellent" "lush"     
 [8] "elegant"   "drink"     "short"    

names

---
title: "Text as Data: Vector Space Models"
output: html_notebook
author: Robert Stine
date: July 2017
---

This notebook provides an example of a  *sentiment analysis*, a dictionary based method that is easy computed from the document term matrix.


# Setup R

The methods in this notebook add another package to the standard list.

```{r}
require(tm)

require(tidytext)   # has dictionaries for sentiment analysis
require(stringr)    # not in the tidyverse
require(tidyverse)
```

# Sentiment analysis

The `tidytext` package has 4 sentiment dictionaries.  You can get a list of these from `get_sentiments`.

```{r}
# get_sentiments("")
```

`View` is a very nice function to use in R-Studio.  You cannot edit the data, just view it; it does, however, have some very nice sorting features. (If the option eval=FALSE is set in the header to an "R chunk", R-Studio won't evaluate that chunk unless you do it manually.)

```{r eval=FALSE}
View(get_sentiments('bing'))
```

The sorting is particularly handy in dictionaries with a variety of codings (Loughran and NRC).

```{r eval=FALSE}
View(get_sentiments('loughran'))
```

*Negative* words tend to dominate dictionaries.  For example the Bing dictionary is dominated by negative words.

```{r}
table(get_sentiments('bing')$sentiment)
```

Read the wine data from its CSV file.  If this corpus were larger and the calculations took longer, it would be a good idea to save the "processed" data in a ".sav" file rather than process it again here.

```{r}
Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
```

```{r}
WineCorpus <- Corpus(VectorSource(Wine$description))

replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))

WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.')     # otherwise runs together; dot is special regex
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) 
```

Sentiment analysis can be done efficiently from the document-term matrix because the DTM has the counts of all the words in the documents.  (See the tidytext book for a different approach.) 

Now compute the document term matrix, along with the row and column marginal counts. (The DTM is a little smaller, with fewer types -- 5,412 -- here than in the first slides because of handling the comma differently and removing the stop words.)

```{r}
dtm <- DocumentTermMatrix(WineCorpus)
dtm

ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))

word.types <- names(mj)   # for convenience and clarity
```

Looks like I have finally gotten rid of the run-together words (at least those that lead to long word types).

```{r}
word.types[j <- which.max(str_length(word.types))]
```

I'll do the sentiment analysis of the wines using the Bing lexicon of positive and negative words. (As some practice you should try do do this with the AFINN lexicon, the lexicon that assigns weights rather than just positive or negative.)

```{r}
Bing <- get_sentiments('bing')
dim(Bing)
```

Which of these words appear in the wine corpus?  Pick out the word types that both appear in the wine corpus *and* in the Bing lexicon. The other are not relevant for the sentiment calculations.

```{r}
keep.types <- intersect(Bing$word, word.types)
length(keep.types)
```

`dplyr` is handy for filtering the Bing data frame, picking the word types that meet this condition.  Also add a numeric score to simplify a later calculation and keep the positive and negative terms separated.
 
```{r}
Bing <- Bing %>%
    filter(word %in% keep.types) %>%
    mutate(pos = ifelse(sentiment=='positive',+1,0),
           neg = ifelse(sentiment=='negative',+1,0),
           score = pos - neg)
dim(Bing)
```

Have a peek, noticing that the words are sorted alphabetically.

```{r eval=FALSE}
View(Bing)
```

Now filter the DTM.  the `dtm` object is a matrix, so use indices. 

```{r}
bing.dtm <- dtm[,word.types %in% keep.types]
bing.dtm
```

Get the columns lined up. We want to make sure that the columns of the DTM are align with the elements of the sentiment dictionary. (Yes, its a little tricky to manage matrices mixed with data frames, but not that hard since we have one of each.)

```{r}
counts <- bing.dtm[,Bing$word]
```

```{r}
any(colnames(counts) != Bing$word)
```

Now counting the number of positive and negative words is easy (easy if you recognize the matrix connection).  It's a matrix multiplication; the call to `as.vector` converts the 1-column matrix into a vector.

```{r}
rating.sentiment <- as.vector(as.matrix(bing.dtm) %*% Bing$score)
```

Since longer reviews accompany better wines, it is not too surprising that the typical net sentiment is also positive. 

```{r}
summary(rating.sentiment)
```

```{r}
hist(rating.sentiment)
```

Are these sentiments related to the number of points assigned to the wine?

```{r}
plot(rating.sentiment, Wine$points, xlab="Sentiment", ylab="Points")
```

Some "dithering" (adding random variation -- a bit of fuzz -- to avoid over-printing) improves the plot, but it's still overwhelmed by the volume of points.  So I just drew a subset of the points.

```{r}
dither <- function(x) return (x + rnorm(length(x),sd=0.05*sd(x, na.rm=TRUE)))
dither(1:10)
```

```{r}
i <- sample(1:length(rating.sentiment), 5000)
plot(dither(rating.sentiment[i]), dither(Wine$points[i]))
```

And if you make a data frame, `ggplot` is prettier still.  It's also then easy to add the regression line.  There's association, but it is not very strong.

```{r}
data_frame(points = Wine$points, sentiment=rating.sentiment) %>%
    ggplot(aes(sentiment,points)) +
    geom_jitter(alpha=0.1) +        # alpha determines the opacity
    geom_smooth(method='lm')
```

The fit is significant, but explains very little variation.

```{r}
summary(lm(Wine$points ~ rating.sentiment))
```

Who says negative words and positive words should have the same weight?

```{r}
rating.pos <- as.vector(as.matrix(bing.dtm) %*% Bing$pos)
rating.neg <- as.vector(as.matrix(bing.dtm) %*% Bing$neg)

any(rating.pos-rating.neg != rating.sentiment)   # all should match!
```

It seems like negative words have a much stronger connection to the points than those effusive positive words, with some noticeable nonlinearity too.  The interaction is not significant, but the nonlinear terms (squares) are, particularly the for the count of negative words.

```{r}
summary(lm(Wine$points ~ rating.pos * rating.neg + I(rating.pos^2) + I(rating.neg^2)))
```


# Make your own dictionary

Because we have a response (namely, the points awarded to a wine), we can get a hint of how to make your own task-specific dictionary.  Let's focus for this example on the words that show up frequently, those that show up in at least 500 times in the tasting notes.  (Note: If this were "for real", I would need to reserve a substantial portion of the tasting notes for a test data set.)

```{r}
sum(500 <= mj)
```

Now regress the points on the number of times these words appear.

```{r}
X <- as.matrix(dtm[,500 <= mj])
```

Most only occur once in a document.

```{r}
sum(X)
sum(1<X)
```

Now do the big regression.  Most of the wines have the points variable.

```{r}
i <- which(!is.na(Wine$points))
length(i)
```

```{r}
y <- Wine$points[i]
X <- X[i,]
```

Run the regression of the points on these counts of words.

```{r}
regr <- lm(y ~ X)
summary(regr)$r.squared
```

When looking at the estimates, it seems "lemon" is indeed a negative word when it comes to wine tasting points, but the effect is very small.

```{r}
b <- coef(regr)[-1]  # drop the intercept
b <- b[order(abs(b), decreasing=TRUE)]
b[1:20]  #  big ones
tail(b, n=20)  # small ones
```

Just because you have a large estimated coefficient doesn't mean that the word should be scored with that weight -- you should also check for statistical precision.  Some rounding and perhaps shrinkage toward zero would also make sense, perhaps by ridge or lasso methods.  You don't see weights in lexicons with several digits of precision.

Finally, R has "messed up" the names by prefixing the name of the matrix "X".  We can fix that easily with the string tools: use a regular expression to replace the first character by nothing.

```{r}
str_replace(names(b),".","")[1:10]
```

names