I use these packages so frequently I will just load them all up front. Others that are occasionally useful will be loaded as needed. require
statements generally precede the use of less common functions from these packages so you have an idea where they are from.
require(tm)
Loading required package: tm
Loading required package: NLP
require(stringr)
Loading required package: stringr
require(tidyverse)
Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------------------------------
annotate(): ggplot2, NLP
filter(): dplyr, stats
lag(): dplyr, stats
source("text_utils.R") # good turing, zipf plot
Notice that in the following code the option “eval=FALSE” is often set in the header to a chunk of R code, typically for View
commands. This option tells R-Studio to skip that chunk when the “Run all chunks” or “Run all chunks above” commands are executed. You can always run these directly.
Start by obtaining the original source from the collection of open-source text offered by Project Gutenberg.
require(gutenbergr)
Loading required package: gutenbergr
content <- gutenberg_works()
names(content)
[1] "gutenberg_id" "title" "author" "gutenberg_author_id"
[5] "language" "gutenberg_bookshelf" "rights" "has_text"
There are too many titles to search manually, so use R.
length(content$title)
[1] 40737
Where are the Federalist papers in this collection? As in other situations, regular expressions work are very useful if you’re dealing with text data.
require(stringr)
b <- str_detect(content$title, ".*Federalist.*")
i <- which(b)
i
[1] 18
content$title[i]
[1] "The Federalist Papers"
content$author[i] # What... no author?
[1] NA
content$gutenberg_id[i]
[1] 18
Now that we have the ID, we can get the text itself. (I routinely use capital letters for the names of data frames.)
TheFederalist <- gutenberg_download(18)
Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org
The result is a table of 25,563 rows and two columns.
dim(TheFederalist)
[1] 25563 2
It is important to take a look at the “raw” data in order to appreciate the following steps.
View(TheFederalist)
Only the second column is of interest to us. It has the actual text lines, many of which are blank. (Why two columns? gutenberg_download
allows you to extract several documents at once; the first column would be used to separate these. Because all of these lines in this example come from the same document, the first column is constant. It would be more common to find these data in separate files, one for each paper.)
Pull out the text column, but leave the one column in a data frame to simplify the subsequent processing. (select
comes from the tidyverse
collection.)
TheFederalist <- select(TheFederalist, text)
dim(TheFederalist)
[1] 25563 1
[Just in case, I saved these data in case there’s a problem with internet access during class.
save(TheFederalist, file="federalist.sav")
You can recover the file using
load("federalist.sav")
Hopefully, I won’t need to use these commands in class.]
The first tasks are to remove these blank lines (assuming we’re not interested in, say, counting those) and then group the text into the separate papers. dplyr
is convenient for for this task, filter (selecting) rows from this one-column data frame to obtain a new data frame with fewer rows. The data contained about 7,000 blank lines.
FedPapers <- TheFederalist %>% filter(text != "")
dim(FedPapers)
[1] 18790 1
Now divide the text
column into the separate papers, collecting the lines for each paper into one.
Here’s a trick I learned from the “Tidy Text” book to handle this task. The idea is to use the counter feature of dplyr
to add a paper number. Then I can join the lines with these numbers. Once again, a regular expression is useful. (Each paper is spread over several lines that we’d like to join together as a single document.)
pattern <- "^FEDERALIST[. ]*No\\." # ^ denotes start of line
FedPapers <- FedPapers %>% mutate(paper = cumsum(str_detect(text,pattern)))
head(FedPapers)
tail(FedPapers)
View(FedPapers)
After skimming the file, one discovers this…
as.character(FedPapers[14847,'text'])
[1] "*There are two slightly different versions of No. 70 included here."
I will remove the second, somewhat manually here from the source data and then assign paper numbers again. (Alternately, you could have used the assigned paper number, but that leaves a straggling line and messes up the numbers of the following papers.)
FedPapers <- TheFederalist %>%
filter(text != "") %>%
filter(row_number() < 14847 | 15155 < row_number()) %>%
mutate(paper = cumsum(str_detect(text,pattern)))
head(FedPapers)
BTW, the leading numbers are footnotes.
tail(FedPapers)
Now pick out the author and build a data frame with each paper as a document. The author of the first paper (Alexander Hamilton) is listed on the 4th line. Let’s see if that pattern holds for other papers. String matching with a basic regular expression makes this easy.
Unfortunately, the pattern varies when topics are continued or dates get added.
Temp <-FedPapers %>%
mutate(line=row_number()) %>%
filter(str_detect(text,"HAMILTON|MADISON|JAY|FEDERALIST"))
Temp
FedPapers[321:325,]
FedPapers[1611:1617,]
It is useful to skim this temporary data frame to see the other differences, such as what happens when there are multiple authors (#20) or unknown authorship (#50)
View(Temp)
To keep track of the authorship of the 85 papers, remove the author names to a separate data frame for later use in labeling the Federalist Papers with a join operation.
Authors <- Temp %>% filter(str_detect(text,"HAMILTON|MADISON|JAY"))
dim(Authors)
[1] 85 3
It is useful for our later classification task to take a look at the names of the authors.
View(Authors)
According to the Gutenberg version, papers 49-57 and 62-63 have “disputed” authorship: either Hamilton or Madison. (Wikipedia has slightly different list of the papers of of disputed authorship, namely 49-58 and 62-63.)
Now remove the author names from the data. (You could remove the numbering line as well.)
FedPapers <- FedPapers %>% filter(!str_detect(text,"HAMILTON|MADISON|JAY"))
dim(FedPapers)
[1] 18396 2
head(FedPapers)
At this point, the tidytext
approach to text analysis diverges from the tm
approach. (I have alternating opinions, but will use the “tm” approach in these lectures.) For tm
, pull the text for each document together rather than being spread over several lines. tapply
is a very useful function for jobs like this.
FedPapers <- tapply(FedPapers$text, # data on separate lines
FedPapers$paper, # grouping variable
str_c, collapse=' ') # function applied to group (aka, "paste")
length(FedPapers)
[1] 85
Now label these papers by the author by making a data frame.
head(Authors$text)
[1] "HAMILTON" "JAY" "JAY" "JAY" "JAY" "HAMILTON"
FederalistPapers <- tibble(author=Authors$text, text=FedPapers)
Have a look at the result.
head(FederalistPapers)
Save this version too (though it is not too hard to recreate).
save(FederalistPapers, file="~/data/text/federalist/FederalistPapers.sav")
Now I can “rejoin” the analysis from this point by running this command.
load("~/data/text/federalist/FederalistPapers.sav")
The next task seems obvious, but has a lasting impact: What’s a word? Is a number a word? What about punctuation? Are “subject” and “subjects” different words? tm
has a collection of tools for taking care of these tasks, but you need to decide which to use.
As in other examples, put the text into a corpus
object. A corpus is usually created when using tm
when a collection of separte documents is read into R. tm
nicely handles many different types of documents, including PDF and Word files. In this example, the text is already in a variable, so use VectorSource
to convert the data into a corpus.
FederalistCorpus <- Corpus(VectorSource(FederalistPapers$text))
FederalistCorpus
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 85
A corpus object in tm
is a decorated list, a list that has been adorned with special attributes when created. That means we can peek into the corpus as if it were a list by referring to the number elements of the list (which are the documents).
To see the text itself, use the inspect
method
inspect(FederalistCorpus[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 9543
FEDERALIST. No. 1 General Introduction For the Independent Journal. To the People of the State of New York: AFTER an unequivocal experience of the inefficacy of the subsisting federal government, you are called upon to deliberate on a new Constitution for the United States of America. The subject speaks its own importance; comprehending in its consequences nothing less than the existence of the UNION, the safety and welfare of the parts of which it is composed, the fate of an empire in many respects the most interesting in the world. It has been frequently remarked that it seems to have been reserved to the people of this country, by their conduct and example, to decide the important question, whether societies of men are really capable or not of establishing good government from reflection and choice, or whether they are forever destined to depend for their political constitutions on accident and force. If there be any truth in the remark, the crisis at which we are arrived may with propriety be regarded as the era in which that decision is to be made; and a wrong election of the part we shall act may, in this view, deserve to be considered as the general misfortune of mankind. This idea will add the inducements of philanthropy to those of patriotism, to heighten the solicitude which all considerate and good men must feel for the event. Happy will it be if our choice should be directed by a judicious estimate of our true interests, unperplexed and unbiased by considerations not connected with the public good. But this is a thing more ardently to be wished than seriously to be expected. The plan offered to our deliberations affects too many particular interests, innovates upon too many local institutions, not to involve in its discussion a variety of objects foreign to its merits, and of views, passions and prejudices little favorable to the discovery of truth. Among the most formidable of the obstacles which the new Constitution will have to encounter may readily be distinguished the obvious interest of a certain class of men in every State to resist all changes which may hazard a diminution of the power, emolument, and consequence of the offices they hold under the State establishments; and the perverted ambition of another class of men, who will either hope to aggrandize themselves by the confusions of their country, or will flatter themselves with fairer prospects of elevation from the subdivision of the empire into several partial confederacies than from its union under one government. It is not, however, my design to dwell upon observations of this nature. I am well aware that it would be disingenuous to resolve indiscriminately the opposition of any set of men (merely because their situations might subject them to suspicion) into interested or ambitious views. Candor will oblige us to admit that even such men may be actuated by upright intentions; and it cannot be doubted that much of the opposition which has made its appearance, or may hereafter make its appearance, will spring from sources, blameless at least, if not respectable--the honest errors of minds led astray by preconceived jealousies and fears. So numerous indeed and so powerful are the causes which serve to give a false bias to the judgment, that we, upon many occasions, see wise and good men on the wrong as well as on the right side of questions of the first magnitude to society. This circumstance, if duly attended to, would furnish a lesson of moderation to those who are ever so much persuaded of their being in the right in any controversy. And a further reason for caution, in this respect, might be drawn from the reflection that we are not always sure that those who advocate the truth are influenced by purer principles than their antagonists. Ambition, avarice, personal animosity, party opposition, and many other motives not more laudable than these, are apt to operate as well upon those who support as those who oppose the right side of a question. Were there not even these inducements to moderation, nothing could be more ill-judged than that intolerant spirit which has, at all times, characterized political parties. For in politics, as in religion, it is equally absurd to aim at making proselytes by fire and sword. Heresies in either can rarely be cured by persecution. And yet, however just these sentiments will be allowed to be, we have already sufficient indications that it will happen in this as in all former cases of great national discussion. A torrent of angry and malignant passions will be let loose. To judge from the conduct of the opposite parties, we shall be led to conclude that they will mutually hope to evince the justness of their opinions, and to increase the number of their converts by the loudness of their declamations and the bitterness of their invectives. An enlightened zeal for the energy and efficiency of government will be stigmatized as the offspring of a temper fond of despotic power and hostile to the principles of liberty. An over-scrupulous jealousy of danger to the rights of the people, which is more commonly the fault of the head than of the heart, will be represented as mere pretense and artifice, the stale bait for popularity at the expense of the public good. It will be forgotten, on the one hand, that jealousy is the usual concomitant of love, and that the noble enthusiasm of liberty is apt to be infected with a spirit of narrow and illiberal distrust. On the other hand, it will be equally forgotten that the vigor of government is essential to the security of liberty; that, in the contemplation of a sound and well-informed judgment, their interest can never be separated; and that a dangerous ambition more often lurks behind the specious mask of zeal for the rights of the people than under the forbidden appearance of zeal for the firmness and efficiency of government. History will teach us that the former has been found a much more certain road to the introduction of despotism than the latter, and that of those men who have overturned the liberties of republics, the greatest number have begun their career by paying an obsequious court to the people; commencing demagogues, and ending tyrants. In the course of the preceding observations, I have had an eye, my fellow-citizens, to putting you upon your guard against all attempts, from whatever quarter, to influence your decision in a matter of the utmost moment to your welfare, by any impressions other than those which may result from the evidence of truth. You will, no doubt, at the same time, have collected from the general scope of them, that they proceed from a source not unfriendly to the new Constitution. Yes, my countrymen, I own to you that, after having given it an attentive consideration, I am clearly of opinion it is your interest to adopt it. I am convinced that this is the safest course for your liberty, your dignity, and your happiness. I affect not reserves which I do not feel. I will not amuse you with an appearance of deliberation when I have decided. I frankly acknowledge to you my convictions, and I will freely lay before you the reasons on which they are founded. The consciousness of good intentions disdains ambiguity. I shall not, however, multiply professions on this head. My motives must remain in the depository of my own breast. My arguments will be open to all, and may be judged of by all. They shall at least be offered in a spirit which will not disgrace the cause of truth. I propose, in a series of papers, to discuss the following interesting particulars: THE UTILITY OF THE UNION TO YOUR POLITICAL PROSPERITY THE INSUFFICIENCY OF THE PRESENT CONFEDERATION TO PRESERVE THAT UNION THE NECESSITY OF A GOVERNMENT AT LEAST EQUALLY ENERGETIC WITH THE ONE PROPOSED, TO THE ATTAINMENT OF THIS OBJECT THE CONFORMITY OF THE PROPOSED CONSTITUTION TO THE TRUE PRINCIPLES OF REPUBLICAN GOVERNMENT ITS ANALOGY TO YOUR OWN STATE CONSTITUTION and lastly, THE ADDITIONAL SECURITY WHICH ITS ADOPTION WILL AFFORD TO THE PRESERVATION OF THAT SPECIES OF GOVERNMENT, TO LIBERTY, AND TO PROPERTY. In the progress of this discussion I shall endeavor to give a satisfactory answer to all the objections which shall have made their appearance, that may seem to have any claim to your attention. It may perhaps be thought superfluous to offer arguments to prove the utility of the UNION, a point, no doubt, deeply engraved on the hearts of the great body of the people in every State, and one, which it may be imagined, has no adversaries. But the fact is, that we already hear it whispered in the private circles of those who oppose the new Constitution, that the thirteen States are of too great extent for any general system, and that we must of necessity resort to separate confederacies of distinct portions of the whole.1 This doctrine will, in all probability, be gradually propagated, till it has votaries enough to countenance an open avowal of it. For nothing can be more evident, to those who are able to take an enlarged view of the subject, than the alternative of an adoption of the new Constitution or a dismemberment of the Union. It will therefore be of use to begin by examining the advantages of that Union, the certain evils, and the probable dangers, to which every State will be exposed from its dissolution. This shall accordingly constitute the subject of my next address. PUBLIUS. 1 The same idea, tracing the arguments to their consequences, is held out in several of the late publications against the new Constitution.
Now tokenize the text. This is the same “script” used in other examples. Be careful with the order of the operations; you cannot replace “FEDERALIST” after moving to lower case, for example. Notice that the stopwords remain since the use of these may indicate the style of an author.
replace <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
toSpace <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
toLower <- content_transformer(function(text) tolower(text))
FederalistCorpus <- tm_map(FederalistCorpus, removeWords, c('FEDERALIST', 'No', 'No.'))
FederalistCorpus <- tm_map(FederalistCorpus, toLower)
FederalistCorpus <- tm_map(FederalistCorpus, toSpace, '-|/|,|\\.')
FederalistCorpus <- tm_map(FederalistCorpus, removePunctuation)
FederalistCorpus <- tm_map(FederalistCorpus, removeNumbers)
FederalistCorpus <- tm_map(FederalistCorpus, stripWhitespace)
inspect(FederalistCorpus[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 9355
general introduction for the independent journal to the people of the state of new york after an unequivocal experience of the inefficacy of the subsisting federal government you are called upon to deliberate on a new constitution for the united states of america the subject speaks its own importance comprehending in its consequences nothing less than the existence of the union the safety and welfare of the parts of which it is composed the fate of an empire in many respects the most interesting in the world it has been frequently remarked that it seems to have been reserved to the people of this country by their conduct and example to decide the important question whether societies of men are really capable or not of establishing good government from reflection and choice or whether they are forever destined to depend for their political constitutions on accident and force if there be any truth in the remark the crisis at which we are arrived may with propriety be regarded as the era in which that decision is to be made and a wrong election of the part we shall act may in this view deserve to be considered as the general misfortune of mankind this idea will add the inducements of philanthropy to those of patriotism to heighten the solicitude which all considerate and good men must feel for the event happy will it be if our choice should be directed by a judicious estimate of our true interests unperplexed and unbiased by considerations not connected with the public good but this is a thing more ardently to be wished than seriously to be expected the plan offered to our deliberations affects too many particular interests innovates upon too many local institutions not to involve in its discussion a variety of objects foreign to its merits and of views passions and prejudices little favorable to the discovery of truth among the most formidable of the obstacles which the new constitution will have to encounter may readily be distinguished the obvious interest of a certain class of men in every state to resist all changes which may hazard a diminution of the power emolument and consequence of the offices they hold under the state establishments and the perverted ambition of another class of men who will either hope to aggrandize themselves by the confusions of their country or will flatter themselves with fairer prospects of elevation from the subdivision of the empire into several partial confederacies than from its union under one government it is not however my design to dwell upon observations of this nature i am well aware that it would be disingenuous to resolve indiscriminately the opposition of any set of men merely because their situations might subject them to suspicion into interested or ambitious views candor will oblige us to admit that even such men may be actuated by upright intentions and it cannot be doubted that much of the opposition which has made its appearance or may hereafter make its appearance will spring from sources blameless at least if not respectable the honest errors of minds led astray by preconceived jealousies and fears so numerous indeed and so powerful are the causes which serve to give a false bias to the judgment that we upon many occasions see wise and good men on the wrong as well as on the right side of questions of the first magnitude to society this circumstance if duly attended to would furnish a lesson of moderation to those who are ever so much persuaded of their being in the right in any controversy and a further reason for caution in this respect might be drawn from the reflection that we are not always sure that those who advocate the truth are influenced by purer principles than their antagonists ambition avarice personal animosity party opposition and many other motives not more laudable than these are apt to operate as well upon those who support as those who oppose the right side of a question were there not even these inducements to moderation nothing could be more ill judged than that intolerant spirit which has at all times characterized political parties for in politics as in religion it is equally absurd to aim at making proselytes by fire and sword heresies in either can rarely be cured by persecution and yet however just these sentiments will be allowed to be we have already sufficient indications that it will happen in this as in all former cases of great national discussion a torrent of angry and malignant passions will be let loose to judge from the conduct of the opposite parties we shall be led to conclude that they will mutually hope to evince the justness of their opinions and to increase the number of their converts by the loudness of their declamations and the bitterness of their invectives an enlightened zeal for the energy and efficiency of government will be stigmatized as the offspring of a temper fond of despotic power and hostile to the principles of liberty an over scrupulous jealousy of danger to the rights of the people which is more commonly the fault of the head than of the heart will be represented as mere pretense and artifice the stale bait for popularity at the expense of the public good it will be forgotten on the one hand that jealousy is the usual concomitant of love and that the noble enthusiasm of liberty is apt to be infected with a spirit of narrow and illiberal distrust on the other hand it will be equally forgotten that the vigor of government is essential to the security of liberty that in the contemplation of a sound and well informed judgment their interest can never be separated and that a dangerous ambition more often lurks behind the specious mask of zeal for the rights of the people than under the forbidden appearance of zeal for the firmness and efficiency of government history will teach us that the former has been found a much more certain road to the introduction of despotism than the latter and that of those men who have overturned the liberties of republics the greatest number have begun their career by paying an obsequious court to the people commencing demagogues and ending tyrants in the course of the preceding observations i have had an eye my fellow citizens to putting you upon your guard against all attempts from whatever quarter to influence your decision in a matter of the utmost moment to your welfare by any impressions other than those which may result from the evidence of truth you will no doubt at the same time have collected from the general scope of them that they proceed from a source not unfriendly to the new constitution yes my countrymen i own to you that after having given it an attentive consideration i am clearly of opinion it is your interest to adopt it i am convinced that this is the safest course for your liberty your dignity and your happiness i affect not reserves which i do not feel i will not amuse you with an appearance of deliberation when i have decided i frankly acknowledge to you my convictions and i will freely lay before you the reasons on which they are founded the consciousness of good intentions disdains ambiguity i shall not however multiply professions on this head my motives must remain in the depository of my own breast my arguments will be open to all and may be judged of by all they shall at least be offered in a spirit which will not disgrace the cause of truth i propose in a series of papers to discuss the following interesting particulars the utility of the union to your political prosperity the insufficiency of the present confederation to preserve that union the necessity of a government at least equally energetic with the one proposed to the attainment of this object the conformity of the proposed constitution to the true principles of republican government its analogy to your own state constitution and lastly the additional security which its adoption will afford to the preservation of that species of government to liberty and to property in the progress of this discussion i shall endeavor to give a satisfactory answer to all the objections which shall have made their appearance that may seem to have any claim to your attention it may perhaps be thought superfluous to offer arguments to prove the utility of the union a point no doubt deeply engraved on the hearts of the great body of the people in every state and one which it may be imagined has no adversaries but the fact is that we already hear it whispered in the private circles of those who oppose the new constitution that the thirteen states are of too great extent for any general system and that we must of necessity resort to separate confederacies of distinct portions of the whole this doctrine will in all probability be gradually propagated till it has votaries enough to countenance an open avowal of it for nothing can be more evident to those who are able to take an enlarged view of the subject than the alternative of an adoption of the new constitution or a dismemberment of the union it will therefore be of use to begin by examining the advantages of that union the certain evils and the probable dangers to which every state will be exposed from its dissolution this shall accordingly constitute the subject of my next address publius the same idea tracing the arguments to their consequences is held out in several of the late publications against the new constitution
We can start to do statistics now. The document term matrix contains the counts of every word time in each document. Each row represents a Federalist Paper, and each column is a word type. Because most of the matrix entries are zeros, it is held in “sparse” format. Only 8% of the elements in the document term matrix are not zero. (Notice that you cannot recover the source corpus from the document term matrix. This matrix represents each document as a “bag of words”.)
dtm <- DocumentTermMatrix(FederalistCorpus)
dtm
<<DocumentTermMatrix (documents: 85, terms: 8590)>>
Non-/sparse entries: 58211/671939
Sparsity : 92%
Maximal term length: 19
Weighting : term frequency (tf)
It is now simple to use matrix functions from R to find the number of words in each document and the number of times each type appears (albeit at the cost of converting the sparse matrix into a dense matrix).
ni <- rowSums(as.matrix(dtm)) # tokens in each document
mj <- colSums(as.matrix(dtm)) # columns are named by the word types; frequency of each
Lots more words in these papers than in wine review. (qplot
is the ggplot
version of plot
.)
qplot(1:85, ni, xlab="Federalist Paper", ylab="Word Count")
The frequency counts in mj
are named and in alphabetical order. We can use these names to produce a nice bar graph with ggplot
.
Freq <- tibble(type = names(mj), count = mj) # ggplot and dplyr want data frames
Freq %>%
top_n(25, count) %>%
mutate(type=reorder(type,count)) %>% # rather than alphabetical
ggplot(aes(type,count)) + geom_col() + coord_flip()
Let’s see what happens without the stop words.
Freq %>%
filter (!type %in% stopwords('english')) %>% # syntax of %in% resembles %>%
top_n(25, count) %>%
mutate(type=reorder(type,count)) %>% # rather than alphabetical
ggplot(aes(type,count)) + geom_col() + coord_flip()
This is a good chance to check out whether this text matches a Zipf distribution.
Recall that a Zipf distribution is characterized by a power law: the frequency of the second most common word is inversely proportional to its rank, \(p_k \propto 1/k\). Said differently, the frequency of the second most common word is half that of the most common, the frequency of the third is one-third the most common, etc. A little algebra shows that for this to occur, then \(\log p_k \approx b_0 - \log k\). That is, a plot of the log of the frequencies should be linear in the log of the rank \(k\), with slope -1. In this example, the slope is larger than -1, but the plot is quite linear.
zipf_plot(mj)
Call:
lm(formula = ly ~ lx, data = df[1:min(n.fit, nrow(df)), ])
Coefficients:
(Intercept) lx
9.1633 -0.8661
The least squares slope for all of the data would besteeper, being dominated by the many less common words.
We can compare vocabulary used by Hamilton to that used by Madison by picking out the papers known to have been written by one or the other. These counts form the basis of naive Bayes model. It is of some concern that we have so many fewer papers written by Madison.
dtm.madison <- dtm[FederalistPapers$author=='MADISON',]
dim(dtm.madison)
[1] 15 8590
dtm.hamilton <- dtm[FederalistPapers$author=='HAMILTON',]
dim(dtm.hamilton)
[1] 51 8590
mj.madison <- colSums(as.matrix(dtm.madison)) # type frequencies for each
mj.hamilton <- colSums(as.matrix(dtm.hamilton))
Counts <- bind_rows( # dplyr style stacks these
data_frame(author="Madison", word=names(mj.madison), count=mj.madison),
data_frame(author="Hamilton", word=names(mj.hamilton), count=mj.hamilton))
Counts
Counts of words by author are confounded with the frequency of authorship: Hamilton wrote more of the Federalist Papers. (We have too few for John Jay. Plus he is not considered in the running for writing the papers of unknown authorship.) The Tidy Text book has many examples of this style of plotting produced by ggplot
.
Counts %>%
filter(300 < count) %>%
ggplot(aes(word,count)) + geom_col() + coord_flip() +
facet_wrap(~author)
Proportions make more sense.
Counts %>%
group_by(author) %>%
mutate(proportion = count / sum(count)) %>%
filter(0.005 < proportion) %>%
ggplot(aes(word,proportion)) + geom_col() + coord_flip() +
facet_wrap(~author)
A scatterplot offers yet a different way to view these data. Rather than look at bar charts, plot the proportions for Hamilton versus those for Madison. A scatterplot requires two variables… one for the Hamilton proportions and another for the Madison proportions. Our data so far has just one column to facilitate using ggplot
. The function spread
in dplyr
splits a column into two.
Proportions <- Counts %>%
group_by(author) %>%
mutate(proportion = count / sum(count)) %>%
select(-count) %>% # messed up without this
spread(key=author, value=proportion)
Proportions
A quick check that these are indeed probability distributions.
colSums(Proportions[,2:3], na.rm=TRUE)
Hamilton Madison
1 1
Proportions %>%
filter(0.001 < Hamilton & 0.001 < Madison) %>%
ggplot(aes(x=Hamilton, y=Madison)) +
geom_abline(color = "gray40", lty = 2) +
geom_point(color='lightgray') +
geom_text(aes(label = word), check_overlap=TRUE, vjust=1.5) +
scale_y_log10() + scale_x_log10()
We can use these distributions of word types to classify the documents of disputed authorship.
Naive Bayes allows us to convert these distributions over the word types into a classifier. The idea is intuitive and works like this. Ignoring issues of sampling variation (as if we had computed the word frequencies from a very large corpus), for every word type \(W_j\) we “know” the probability \(P_{author}(W_j)\) for \(author \in \{Hamilton, Madison\}\). Here comes two big assumptions: we’re going to ignore the order of the words and then pretend that they occur independently conditionally on knowing the author.
There’s a rationale that supports this approach, and this rationale uses Bayes Theorem (hence the name). Given a document \(D = \{w_1, w_2, \ldots, w_i, \ldots, w_n\}\) – a sequence of word tokens – we want to assign the document to a class, namely identify the author as either Hamilton or Madison. The optimal solution is to assign based on the maximal probability, \(P(author|D)\). But how can we find that conditional probability? Bayes Rule: \(P(author|D) = P(D|author)P(author)/P(D)\). The normalizing factor \(P(D)\) is constant (does not depend on the author), so we need to find the author that maximizes \(P(D|author)P(author)\). The prior probability is something we can defer to historians (or just set to 1/2), but the other probability is harder.
What should we use for \(P(D|author) = P(\{w_1, w_2, \ldots, w_i, \ldots, w_n\}|author)\)? That’s easy if we’re willing to assume the word choices are independent given the author: \[ P(\{w_1, w_2, \ldots, w_n\}|author) = \prod_{i=1}^n P(w_i|author)\]
This expression explains why this is called “naive” Bayes! Do you really think the choice of the next word is independent given you know the author. That said, this assumption makes it easy to compute because we have both the proportions and the counts for the various documents.
The only catch is what to do if, say, Paper #49 has a word that, say, Hamilton never used in the papers he is known to have written. Should this make the probability of Hamilton being the author zero? There are an elaborate collection of ways to handle such out-of-vocabulary words. Good-Turing smoothing replaces the zero (and shifts other small probabilities as well). (Yes, this is the same Alan Turing as in the recent movie.) The function good_turing_probababilities
(from the \(\tt text_files.R\) collection) does the needed adjustment.
prob.madison <- good_turing_probabilities(mj.madison )
prob.hamilton <- good_turing_probabilities(mj.hamilton)
The log-probability is now easy to compute. (Be careful… we want larger values, but log probabilities are negative) Let’s start with papers of known authorship. Paper #1 is by Hamilton, and the naive Bayes agrees.
C <- as.matrix(dtm)
paper <- 1
sum(log(prob.hamilton)*C[paper,])
[1] -8243.313
sum(log(prob.madison) *C[paper,])
[1] -8426.641
Madison wrote Paper #10, and again naive Bayes agrees.
paper <- 10
sum(log(prob.hamilton)*C[paper,])
[1] -15701.96
sum(log(prob.madison) *C[paper,])
[1] -15277.86
For paper 49 (of debated authorship) naive Bayes gives the authorship nod to Madison, albeit by a much closer margin than the others of known authorship (which were used to build the probabilities used by naive Bayes).
paper <- 49
sum(log(prob.hamilton)*C[paper,])
[1] -8284.31
sum(log(prob.madison) *C[paper,])
[1] -8264.077
We can make a nice plot that summarizes these results for all of the papers. Matrix multiplication avoids looping over the papers.
dim(dtm)
[1] 85 8590
length(prob.hamilton)
[1] 8590
lp.hamilton <- C %*% log(prob.hamilton)
lp.madison <- C %*% log(prob.madison)
diff <- lp.hamilton - lp.madison
diff[c(1,10,49)]
[1] 183.32795 -424.09451 -20.23272
Naive Bayes assigns most – but not all – of the disputed papers to Madison. The Wiki would differ! What would LSA do?
tibble(paper=1:85, author=FederalistPapers$author, diff=as.vector(diff)) %>%
ggplot(aes(paper,diff,color=author)) +
geom_point() + labs(y="Log Likelihood Ratio")
NA