---
title: "Text as Data: Naive Bayes"
output: html_notebook
author: Robert Stine
date: July 2017
---

# Setup R

I use these packages so frequently I will just load them all up front.  Others that are occasionally useful will be loaded as needed.  `require` statements generally precede the use of less common functions from these packages so you have an idea where they are from.

```{r}
require(tm)
require(stringr)    
require(tidyverse)

source("text_utils.R")   # good turing, zipf plot
```

Notice that in the following code the option "eval=FALSE" is often set in the header to a chunk of R code, typically for `View` commands.  This option tells R-Studio to skip that chunk when the "Run all chunks" or "Run all chunks above" commands are executed.  You can always run these directly.


# Getting the source data: The Federalist Papers

Start by obtaining the original source from the collection of open-source text offered by Project Gutenberg. 

```{r}
require(gutenbergr)

content <- gutenberg_works()
names(content)
```

There are too many titles to search manually, so use R.

```{r}
length(content$title)
```

Where are the Federalist papers in this collection?  As in other situations, *regular expressions* work are very useful if you're dealing with text data.

```{r}
require(stringr)
b <- str_detect(content$title, ".*Federalist.*")
i <- which(b)
i
```

```{r}
content$title[i]
content$author[i]        # What... no author?
content$gutenberg_id[i]
```

Now that we have the ID, we can get the text itself.  (I routinely use capital letters for the names of data frames.)

```{r}
TheFederalist <- gutenberg_download(18)
```

The result is a table of 25,563 rows and two columns.

```{r}
dim(TheFederalist)
```

It is important to take a look at the "raw" data in order to appreciate the following steps.

```{r}
View(TheFederalist)
```

Only the second column is of interest to us.  It has the actual text lines, many of which are blank. (Why two columns? `gutenberg_download` allows you to extract several documents at once; the first column would be used to separate these. Because all of these lines in this example come from the same document, the first column is constant.  It would be more common to find these data in separate files, one for each paper.)

Pull out the text column, but leave the one column in a data frame to simplify the subsequent processing.  (`select` comes from the `tidyverse` collection.)

```{r}
TheFederalist <- select(TheFederalist, text)
dim(TheFederalist)
```

[Just in case, I saved these data in case there's a problem with internet access during class.
```{r eval=FALSE}
save(TheFederalist, file="federalist.sav")
```
You can recover the file using
```{r eval=FALSE}
load("federalist.sav")
```
Hopefully, I won't need to use these commands in class.]


# Preparing the text

The first tasks are to remove these blank lines (assuming we're not interested in, say, counting those) and then group the text into the separate papers.  `dplyr` is convenient for for this task, filter (selecting) rows from this one-column *data frame* to obtain a new data frame with fewer rows.  The data contained about 7,000 blank lines.

```{r}
FedPapers <- TheFederalist %>% filter(text != "")
dim(FedPapers)
```

Now divide the `text` column into the separate papers, collecting the lines for each paper into one.

Here's a trick I learned from the "Tidy Text" book to handle this task.  The idea is to use the counter feature of `dplyr` to add a paper number.  Then I can join the lines with these numbers.  Once again, a regular expression is useful. (Each paper is spread over several lines that we'd like to join together as a single document.)

```{r}
pattern <- "^FEDERALIST[. ]*No\\."   # ^ denotes start of line
FedPapers <- FedPapers %>% mutate(paper = cumsum(str_detect(text,pattern)))

head(FedPapers)
```

```{r}
tail(FedPapers)
```

```{r eval=FALSE}
View(FedPapers)
```

After skimming the file, one discovers this...

```{r}
as.character(FedPapers[14847,'text'])
```

I will remove the second, somewhat manually here from the source data and then assign paper numbers again.  (Alternately, you could have used the assigned paper number, but that leaves a straggling line and messes up the numbers of the following papers.)

```{r}
FedPapers <- TheFederalist %>% 
    filter(text != "")     %>%
    filter(row_number() < 14847 | 15155 < row_number()) %>%
    mutate(paper = cumsum(str_detect(text,pattern)))

head(FedPapers)
```

BTW, the leading numbers are *footnotes*.

```{r}
tail(FedPapers)
```

Now pick out the author and build a data frame with each paper as a document.  The author of the first paper (Alexander Hamilton) is listed on the 4th line.  Let's see if that pattern holds for other papers.  String matching with a basic regular expression makes this easy. 

Unfortunately, the pattern varies when topics are continued or dates get added.

```{r}
Temp <-FedPapers                  %>% 
    mutate(line=row_number())     %>%   
    filter(str_detect(text,"HAMILTON|MADISON|JAY|FEDERALIST"))
Temp
```

```{r}
FedPapers[321:325,]
```

```{r}
FedPapers[1611:1617,]
```

It is useful to skim this temporary data frame to see the other differences, such as what happens when there are multiple authors (#20) or unknown authorship (#50)

```{r eval=FALSE}
View(Temp)
```

To keep track of the authorship of the 85 papers,  remove the author names to a separate data frame for later use in labeling the Federalist Papers with a *join* operation.  

```{r}
Authors <- Temp %>% filter(str_detect(text,"HAMILTON|MADISON|JAY"))
dim(Authors)
```

It is useful for our later classification task to take a look at the names of the authors.  

```{r eval=FALSE}
View(Authors)
```

According to the Gutenberg version, papers 49-57 and 62-63 have "disputed" authorship: either Hamilton or Madison.  (Wikipedia has slightly different list of the papers of of disputed authorship, namely 49-58 and 62-63.)

Now remove the author names from the data.  (You could remove the numbering line as well.)
```{r}
FedPapers <- FedPapers %>% filter(!str_detect(text,"HAMILTON|MADISON|JAY"))
dim(FedPapers)
```
```{r}
head(FedPapers)
```

At this point, the `tidytext` approach to text analysis diverges from the `tm` approach.  (I have alternating opinions, but will use the "tm" approach in these lectures.) For `tm`, pull the text for each document together rather than being spread over several lines.  `tapply` is a very useful function for jobs like this.

```{r}
FedPapers <- tapply(FedPapers$text,         # data on separate lines
                    FedPapers$paper,        # grouping variable
                    str_c, collapse=' ')    # function applied to group   (aka, "paste")
```

```{r}
length(FedPapers)
```

Now label these papers by the author by making a data frame.  

```{r}
head(Authors$text)
```

```{r}
FederalistPapers <- tibble(author=Authors$text, text=FedPapers)
```

Have a look at the result.

```{r}
head(FederalistPapers)
```

Save this version too (though it is not too hard to recreate).

```{r eval=FALSE}
save(FederalistPapers, file="~/data/text/federalist/FederalistPapers.sav")
```

Now I can "rejoin" the analysis from this point by running this command.

```{r eval=FALSE}
load("~/data/text/federalist/FederalistPapers.sav")
```


# Tokenization

The next task seems obvious, but has a lasting impact:  What's a word?  Is a number a word?  What about punctuation?  Are "subject" and "subjects" different words?  `tm` has a collection of tools for taking care of these tasks, but you need to decide which to use. 

As in other examples, put the text into a `corpus` object.  A corpus is usually created when using `tm` when a collection of separte documents is read into R.  `tm` nicely handles many different types of documents, including PDF and Word files.  In this example, the text is already in a variable, so use `VectorSource` to convert the data into a corpus.

```{r}
FederalistCorpus <- Corpus(VectorSource(FederalistPapers$text))
FederalistCorpus
```

A corpus object in `tm` is a decorated list, a list that has been adorned with special attributes when created.  That means we can peek into the corpus as if it were a list by referring to the number elements of the list (which are the documents).

To see the text itself, use the `inspect` method

```{r}
inspect(FederalistCorpus[[1]])
```

Now tokenize the text.  This is the same "script" used in other examples.  Be careful with the order of the operations; you cannot replace "FEDERALIST" after moving to lower case, for example.  Notice that the stopwords remain since the use of these may indicate the style of an author.

```{r}
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))

FederalistCorpus <- tm_map(FederalistCorpus, removeWords, c('FEDERALIST', 'No', 'No.'))
FederalistCorpus <- tm_map(FederalistCorpus, toLower)
FederalistCorpus <- tm_map(FederalistCorpus, toSpace, '-|/|,|\\.') 
FederalistCorpus <- tm_map(FederalistCorpus, removePunctuation)
FederalistCorpus <- tm_map(FederalistCorpus, removeNumbers)
FederalistCorpus <- tm_map(FederalistCorpus, stripWhitespace)
```

```{r}
inspect(FederalistCorpus[[1]])
```


# Document term matrix

We can start to do statistics now.  The document term matrix contains the counts of every word time in each document. Each row represents a Federalist Paper, and each column is a word type. Because most of the matrix entries are zeros, it is held in "sparse" format. Only 8% of the elements in the document term matrix are not zero.  (Notice that you *cannot* recover the source corpus from the document term matrix. This matrix represents each document as a "bag of words".)

```{r}
dtm <- DocumentTermMatrix(FederalistCorpus)
dtm
```

It is now simple to use matrix functions from R to find the number of words in each document and the number of times each type appears (albeit at the cost of converting the sparse matrix into a dense matrix).  

```{r}
ni <- rowSums(as.matrix(dtm))  # tokens in each document
mj <- colSums(as.matrix(dtm))  # columns are named by the word types; frequency of each
```

Lots more words in these papers than in wine review.  (`qplot` is the `ggplot` version of `plot`.)

```{r}
qplot(1:85, ni, xlab="Federalist Paper", ylab="Word Count")
```

The frequency counts in `mj` are named and in alphabetical order.  We can use these names to produce a nice bar graph with `ggplot`.  

```{r}
Freq <- tibble(type = names(mj), count = mj)     # ggplot and dplyr want data frames

Freq %>%
    top_n(25, count)                         %>%
    mutate(type=reorder(type,count))         %>%     # rather than alphabetical
    ggplot(aes(type,count)) + geom_col() + coord_flip()
```

Let's see what happens without the stop words.

```{r}
Freq %>%
    filter (!type %in% stopwords('english')) %>%     # syntax of %in% resembles %>%
    top_n(25, count)                         %>%
    mutate(type=reorder(type,count))         %>%     # rather than alphabetical
    ggplot(aes(type,count)) + geom_col() + coord_flip()
```

This is a good chance to check out whether this text matches a Zipf distribution.  

Recall that a Zipf distribution is characterized by a power law:  the frequency of the second most common word is inversely proportional to its rank, $p_k \propto 1/k$.  Said differently, the frequency of the second most common word is half that of the most common, the frequency of the third is one-third the most common, etc.  A little algebra shows that for this to occur, then $\log p_k \approx b_0 - \log k$.  That is, a plot of the log of the frequencies should be linear in the log of the rank $k$, with slope -1.  In  this example, the slope is larger than -1, but the plot is quite linear.

```{r}
zipf_plot(mj)
```

The least squares slope for all of the data would besteeper, being dominated by the many less common words.

# Comparing vocabularies

We can compare vocabulary used by Hamilton to that used by Madison by picking out the papers known to have been written by one or the other.  These counts form the basis of naive Bayes model.  It is of some concern that we have so many fewer papers written by Madison.

```{r}
dtm.madison <- dtm[FederalistPapers$author=='MADISON',]
dim(dtm.madison)

dtm.hamilton <- dtm[FederalistPapers$author=='HAMILTON',]
dim(dtm.hamilton)
```

```{r}
mj.madison  <- colSums(as.matrix(dtm.madison))       # type frequencies for each
mj.hamilton <- colSums(as.matrix(dtm.hamilton))

Counts <- bind_rows(                                 # dplyr style stacks these
            data_frame(author="Madison",  word=names(mj.madison),  count=mj.madison),
            data_frame(author="Hamilton", word=names(mj.hamilton), count=mj.hamilton))

Counts
```

Counts of words by author are confounded with the frequency of authorship: Hamilton wrote more of the Federalist Papers.  (We have too few for John Jay.  Plus he is not considered in the running for writing the papers of unknown authorship.)  The Tidy Text book has many examples of this style of plotting produced by `ggplot`. 

```{r}
Counts %>%
    filter(300 < count) %>%
    ggplot(aes(word,count)) + geom_col() + coord_flip() +
    facet_wrap(~author)
```                 

Proportions make more sense.

```{r}
Counts %>%
    group_by(author) %>%
    mutate(proportion = count / sum(count)) %>%
    filter(0.005 < proportion)        %>%
    ggplot(aes(word,proportion)) + geom_col() + coord_flip() +
    facet_wrap(~author)
```

A scatterplot offers yet a different way to view these data.  Rather than look at bar charts, plot the proportions for Hamilton versus those for Madison.  A scatterplot requires *two* variables... one for the Hamilton proportions and another for the Madison proportions.  Our data so far has just *one* column to facilitate using `ggplot`. The function `spread` in `dplyr` splits a column into two.  

```{r}
Proportions <- Counts %>%
    group_by(author)                        %>%
    mutate(proportion = count / sum(count)) %>%
    select(-count)                          %>%   # messed up without this
    spread(key=author, value=proportion)

Proportions
```

A quick check that these are indeed probability distributions.

```{r}
colSums(Proportions[,2:3], na.rm=TRUE)
```


```{r}
Proportions %>% 
    filter(0.001 < Hamilton & 0.001 < Madison) %>%
    ggplot(aes(x=Hamilton, y=Madison))  +
    geom_abline(color = "gray40", lty = 2) +
    geom_point(color='lightgray') +
    geom_text(aes(label = word), check_overlap=TRUE, vjust=1.5) + 
    scale_y_log10() + scale_x_log10()
```


# Building the naive Bayes classifier

We can use these distributions of word types to classify the documents of disputed authorship.

Naive Bayes allows us to convert these distributions over the word types into a classifier.  The idea is intuitive and works like this.  Ignoring issues of sampling variation (as if we had computed the word frequencies from a *very* large corpus), for every word type $W_j$ we "know" the probability $P_{author}(W_j)$ for $author \in \{Hamilton, Madison\}$.  Here comes two big assumptions: we're going to ignore the order of the words and then pretend that they occur independently *conditionally* on knowing the author.

There's a rationale that supports this approach, and this rationale uses Bayes Theorem (hence the name).  Given a document $D = \{w_1, w_2, \ldots, w_i, \ldots, w_n\}$ -- a sequence of word tokens -- we want to assign the document to a *class*, namely identify the author as either Hamilton or Madison.  The optimal solution is to assign based on the maximal probability, $P(author|D)$.  But how can we find that conditional probability?  Bayes Rule:  $P(author|D) = P(D|author)P(author)/P(D)$.  The normalizing factor $P(D)$ is constant (does not depend on the author), so we need to find the author that maximizes $P(D|author)P(author)$.  The prior probability is something we can defer to historians (or just set to 1/2), but the other probability is harder.

What should we use for $P(D|author) = P(\{w_1, w_2, \ldots, w_i, \ldots, w_n\}|author)$?  That's easy *if* we're willing to assume the word choices are independent *given* the author:
$$ P(\{w_1, w_2, \ldots, w_n\}|author) = \prod_{i=1}^n P(w_i|author)$$

This expression explains why this is called "naive" Bayes! Do you really think the choice of the next word is independent given you know the author.  That said, this assumption makes it easy to compute because we have both the proportions and the counts for the various documents.  

The only catch is what to do if, say, Paper #49 has a word that, say, Hamilton never used in the papers he is known to have written.  Should this make the probability of Hamilton being the author zero?  There are an elaborate collection of ways to handle such *out-of-vocabulary* words.  Good-Turing smoothing replaces the zero (and shifts other small probabilities as well). (Yes, this is the same Alan Turing as in the recent movie.) The function `good_turing_probababilities` (from the $\tt text_files.R$ collection) does the needed adjustment.

```{r}
prob.madison  <- good_turing_probabilities(mj.madison )
prob.hamilton <- good_turing_probabilities(mj.hamilton)
```

The log-probability is now easy to compute.  (Be careful... we want larger values, but log probabilities are negative)  Let's start with papers of known authorship.  Paper #1 is by Hamilton, and the naive Bayes agrees.

```{r}
C <- as.matrix(dtm)

paper <- 1
sum(log(prob.hamilton)*C[paper,])
sum(log(prob.madison) *C[paper,])
```

Madison wrote Paper #10, and again naive Bayes agrees.

```{r}
paper <- 10
sum(log(prob.hamilton)*C[paper,])
sum(log(prob.madison) *C[paper,])
```

For paper 49 (of debated authorship) naive Bayes gives the authorship nod to Madison, albeit by a much closer margin than the others of known authorship (which were used to build the probabilities used by naive Bayes).

```{r}
paper <- 49
sum(log(prob.hamilton)*C[paper,])
sum(log(prob.madison) *C[paper,])
```

We can make a nice plot that summarizes these results for all of the papers.  Matrix multiplication avoids looping over the papers.

```{r}
dim(dtm)
length(prob.hamilton)
```

```{r}

lp.hamilton <- C %*% log(prob.hamilton)
lp.madison  <- C %*% log(prob.madison)

diff <- lp.hamilton - lp.madison
diff[c(1,10,49)]
```

Naive Bayes assigns most -- but not all -- of the disputed papers to Madison.  The Wiki would differ!  What would LSA do?

```{r}
tibble(paper=1:85, author=FederalistPapers$author, diff=as.vector(diff)) %>%
    ggplot(aes(paper,diff,color=author)) +
    geom_point() + labs(y="Log Likelihood Ratio")
    
```