---
title: "Text as Data: Vector Space Models"
output: html_notebook
author: Robert Stine
date: July 2017
---

This notebook provides an example of a  *sentiment analysis*, a dictionary based method that is easy computed from the document term matrix.


# Setup R

The methods in this notebook add another package to the standard list.

```{r}
require(tm)

require(tidytext)   # has dictionaries for sentiment analysis
require(stringr)    # not in the tidyverse
require(tidyverse)
```

# Sentiment analysis

The `tidytext` package has 4 sentiment dictionaries.  You can get a list of these from `get_sentiments`.

```{r}
# get_sentiments("")
```

`View` is a very nice function to use in R-Studio.  You cannot edit the data, just view it; it does, however, have some very nice sorting features. (If the option eval=FALSE is set in the header to an "R chunk", R-Studio won't evaluate that chunk unless you do it manually.)

```{r eval=FALSE}
View(get_sentiments('bing'))
```

The sorting is particularly handy in dictionaries with a variety of codings (Loughran and NRC).

```{r eval=FALSE}
View(get_sentiments('loughran'))
```

*Negative* words tend to dominate dictionaries.  For example the Bing dictionary is dominated by negative words.

```{r}
table(get_sentiments('bing')$sentiment)
```

Read the wine data from its CSV file.  If this corpus were larger and the calculations took longer, it would be a good idea to save the "processed" data in a ".sav" file rather than process it again here.

```{r}
Wine <- read_csv("../data/Wine.csv", col_types = cols(alcohol = col_double()))
dim(Wine)
```

```{r}
WineCorpus <- Corpus(VectorSource(Wine$description))

replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))

WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, replace, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/|,|\\.')     # otherwise runs together; dot is special regex
WineCorpus <- tm_map(WineCorpus, removePunctuation)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english")) 
```

Sentiment analysis can be done efficiently from the document-term matrix because the DTM has the counts of all the words in the documents.  (See the tidytext book for a different approach.) 

Now compute the document term matrix, along with the row and column marginal counts. (The DTM is a little smaller, with fewer types -- 5,412 -- here than in the first slides because of handling the comma differently and removing the stop words.)

```{r}
dtm <- DocumentTermMatrix(WineCorpus)
dtm

ni <- rowSums(as.matrix(dtm))
mj <- colSums(as.matrix(dtm))

word.types <- names(mj)   # for convenience and clarity
```

Looks like I have finally gotten rid of the run-together words (at least those that lead to long word types).

```{r}
word.types[j <- which.max(str_length(word.types))]
```

I'll do the sentiment analysis of the wines using the Bing lexicon of positive and negative words. (As some practice you should try do do this with the AFINN lexicon, the lexicon that assigns weights rather than just positive or negative.)

```{r}
Bing <- get_sentiments('bing')
dim(Bing)
```

Which of these words appear in the wine corpus?  Pick out the word types that both appear in the wine corpus *and* in the Bing lexicon. The other are not relevant for the sentiment calculations.

```{r}
keep.types <- intersect(Bing$word, word.types)
length(keep.types)
```

`dplyr` is handy for filtering the Bing data frame, picking the word types that meet this condition.  Also add a numeric score to simplify a later calculation and keep the positive and negative terms separated.
 
```{r}
Bing <- Bing %>%
    filter(word %in% keep.types) %>%
    mutate(pos = ifelse(sentiment=='positive',+1,0),
           neg = ifelse(sentiment=='negative',+1,0),
           score = pos - neg)
dim(Bing)
```

Have a peek, noticing that the words are sorted alphabetically.

```{r eval=FALSE}
View(Bing)
```

Now filter the DTM.  the `dtm` object is a matrix, so use indices. 

```{r}
bing.dtm <- dtm[,word.types %in% keep.types]
bing.dtm
```

Get the columns lined up. We want to make sure that the columns of the DTM are align with the elements of the sentiment dictionary. (Yes, its a little tricky to manage matrices mixed with data frames, but not that hard since we have one of each.)

```{r}
counts <- bing.dtm[,Bing$word]
```

```{r}
any(colnames(counts) != Bing$word)
```

Now counting the number of positive and negative words is easy (easy if you recognize the matrix connection).  It's a matrix multiplication; the call to `as.vector` converts the 1-column matrix into a vector.

```{r}
rating.sentiment <- as.vector(as.matrix(bing.dtm) %*% Bing$score)
```

Since longer reviews accompany better wines, it is not too surprising that the typical net sentiment is also positive. 

```{r}
summary(rating.sentiment)
```

```{r}
hist(rating.sentiment)
```

Are these sentiments related to the number of points assigned to the wine?

```{r}
plot(rating.sentiment, Wine$points, xlab="Sentiment", ylab="Points")
```

Some "dithering" (adding random variation -- a bit of fuzz -- to avoid over-printing) improves the plot, but it's still overwhelmed by the volume of points.  So I just drew a subset of the points.

```{r}
dither <- function(x) return (x + rnorm(length(x),sd=0.05*sd(x, na.rm=TRUE)))
dither(1:10)
```

```{r}
i <- sample(1:length(rating.sentiment), 5000)
plot(dither(rating.sentiment[i]), dither(Wine$points[i]))
```

And if you make a data frame, `ggplot` is prettier still.  It's also then easy to add the regression line.  There's association, but it is not very strong.

```{r}
data_frame(points = Wine$points, sentiment=rating.sentiment) %>%
    ggplot(aes(sentiment,points)) +
    geom_jitter(alpha=0.1) +        # alpha determines the opacity
    geom_smooth(method='lm')
```

The fit is significant, but explains very little variation.

```{r}
summary(lm(Wine$points ~ rating.sentiment))
```

Who says negative words and positive words should have the same weight?

```{r}
rating.pos <- as.vector(as.matrix(bing.dtm) %*% Bing$pos)
rating.neg <- as.vector(as.matrix(bing.dtm) %*% Bing$neg)

any(rating.pos-rating.neg != rating.sentiment)   # all should match!
```

It seems like negative words have a much stronger connection to the points than those effusive positive words, with some noticeable nonlinearity too.  The interaction is not significant, but the nonlinear terms (squares) are, particularly the for the count of negative words.

```{r}
summary(lm(Wine$points ~ rating.pos * rating.neg + I(rating.pos^2) + I(rating.neg^2)))
```


# Make your own dictionary

Because we have a response (namely, the points awarded to a wine), we can get a hint of how to make your own task-specific dictionary.  Let's focus for this example on the words that show up frequently, those that show up in at least 500 times in the tasting notes.  (Note: If this were "for real", I would need to reserve a substantial portion of the tasting notes for a test data set.)

```{r}
sum(500 <= mj)
```

Now regress the points on the number of times these words appear.

```{r}
X <- as.matrix(dtm[,500 <= mj])
```

Most only occur once in a document.

```{r}
sum(X)
sum(1<X)
```

Now do the big regression.  Most of the wines have the points variable.

```{r}
i <- which(!is.na(Wine$points))
length(i)
```

```{r}
y <- Wine$points[i]
X <- X[i,]
```

Run the regression of the points on these counts of words.

```{r}
regr <- lm(y ~ X)
summary(regr)$r.squared
```

When looking at the estimates, it seems "lemon" is indeed a negative word when it comes to wine tasting points, but the effect is very small.

```{r}
b <- coef(regr)[-1]  # drop the intercept
b <- b[order(abs(b), decreasing=TRUE)]
b[1:20]  #  big ones
tail(b, n=20)  # small ones
```

Just because you have a large estimated coefficient doesn't mean that the word should be scored with that weight -- you should also check for statistical precision.  Some rounding and perhaps shrinkage toward zero would also make sense, perhaps by ridge or lasso methods.  You don't see weights in lexicons with several digits of precision.

Finally, R has "messed up" the names by prefixing the name of the matrix "X".  We can fix that easily with the string tools: use a regular expression to replace the first character by nothing.

```{r}
str_replace(names(b),".","")[1:10]
```

names