This notebook describes some basics of text modeling in R. In particular, the example highlights building a document-term matrix using the package tm. The example concludes by finding words associated with low and high values of a numerical variable, in this case, prices of wine.

Setup R

I use these packages so frequently I will load them into R up front. Others that are occasionally useful will be loaded as needed. (Unlike library, require loads a package only if if it was not already present in the working environment.)

require(tm)

Loading required package: tm
Loading required package: NLP

require(tidyverse)

Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------------------------------------------------------------
annotate(): ggplot2, NLP
filter():   dplyr, stats
lag():      dplyr, stats

require(stringr)

Loading required package: stringr

Building a document corpus

Start by reading the wine data. It is in a CSV file. The function read_csv creates a “tidy” data frame that, for example, does not convert text to factors by default. It comes from the readr package, part of tidyverse. (You will need to download these data to your computer and use the appropriate path.)

Wine <- read_csv("../data/Wine.csv")   # I capitalize the names of data frames

Parsed with column specification:
cols(
  review = col_integer(),
  id = col_integer(),
  label = col_character(),
  description = col_character(),
  type = col_character(),
  alcohol = col_integer(),
  location = col_character(),
  date = col_character(),
  rating = col_character(),
  variety = col_character(),
  vintage = col_integer(),
  color = col_character(),
  points = col_integer(),
  price = col_double()
)
number of columns of result is not a multiple of vector length (arg 1)4945 parsing failures.
row # A tibble: 5 x 5 col     row     col               expected actual               file expected   <int>   <chr>                  <chr>  <chr>              <chr> actual 1  3786 alcohol no trailing characters     .4 '../data/Wine.csv' file 2  3787 alcohol no trailing characters     .2 '../data/Wine.csv' row 3  3788 alcohol no trailing characters     .5 '../data/Wine.csv' col 4  3857 alcohol no trailing characters     .2 '../data/Wine.csv' expected 5  3858 alcohol no trailing characters     .4 '../data/Wine.csv'
... ................. ... ................................................................ ........ ................................................................ ...... ................................................................ .... ................................................................ ... ................................................................ ... ................................................................ ........ ................................................................
See problems(...) for more details.

dim(Wine)

[1] 20508    14

The function read_csv gets upset because it expects the alcohol variable to be integer valued, but then it discovers some decimal points. We can get read_csv to ignore this problem by telling it that values of the alcohol variable are doubles (or perhaps better fix those data values to remove the decimals).

Wine <- read_csv("../data/Wine.csv", col_types=cols(alcohol='d'))
dim(Wine)

[1] 20508    14

summary(Wine)

     review            id            label           description            type          
 Min.   :    1   Min.   :163522   Length:20508       Length:20508       Length:20508      
 1st Qu.: 5335   1st Qu.:171463   Class :character   Class :character   Class :character  
 Median :10572   Median :179542   Mode  :character   Mode  :character   Mode  :character  
 Mean   :10538   Mean   :180035                                                           
 3rd Qu.:15758   3rd Qu.:188588                                                           
 Max.   :20888   Max.   :198253                                                           
                                                                                          
    alcohol        location             date              rating            variety         
 Min.   : 1.00   Length:20508       Length:20508       Length:20508       Length:20508      
 1st Qu.:13.00   Class :character   Class :character   Class :character   Class :character  
 Median :13.50   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Mean   :13.39                                                                              
 3rd Qu.:14.00                                                                              
 Max.   :41.00                                                                              
 NA's   :537                                                                                
    vintage        color               points         price        
 Min.   :1969   Length:20508       Min.   :79.0   Min.   :   1.99  
 1st Qu.:2001   Class :character   1st Qu.:85.0   1st Qu.:  11.80  
 Median :2004   Mode  :character   Median :87.0   Median :  16.00  
 Mean   :2004                      Mean   :86.8   Mean   :  21.30  
 3rd Qu.:2007                      3rd Qu.:89.0   3rd Qu.:  25.00  
 Max.   :2011                      Max.   :99.0   Max.   :2006.00  
 NA's   :2075                      NA's   :179    NA's   :1923

Now focus on the column description that holds the tasting notes.

Wine$description[1:4]

[1] "Lemon oil and grapefruit aromas follow through on a medium-bodied palate with impressive wieght and a dry, tart finish."                                                                                              
[2] "Bacon fat, black cherry, dill, oak aromas. A rich entry leads to a moderately full-bodied palate with forward fruit and a finish that offers sleek tannins and fine acidity. A more subtle style of California Syrah."
[3] "Earthy, herbal, slightly herbaceous aromas. A medium-bodied palate leads to a short finish that is earthy, tart and has limited fruit."                                                                               
[4] "Cedar, cherry tomato, and herbal aromas. A rich entry leads to a moderately full-bodied palate with forward fruit and a big finish that offers ripe fruit, moderate tannins and acidity."

Wine$description[1:5]

The prices are a bit skewed.

Wine %>%
    ggplot(aes(x=price)) + geom_histogram() + scale_x_log10()

And differ slightly between red versus white wines. The comparison is easier to visualize with frequency polygons that are not filled in.

Wine %>%
    filter(!is.na(color)) %>%
    ggplot(aes(x=price, ..density.., color=color)) + geom_freqpoly() + scale_x_log10()

As usual with R, there are lots of alternative displays, such as side-by-side boxplots that tell a similar story. (ggplot2 generates “pretty” graphs, but you might find the syntax overwhelming at first.)

boxplot(price ~ color, data=Wine, log='y')

The next task seems obvious, but has a lasting impact: What’s a word? Is a number a word? What about punctuation? Are “subject” and “subjects” different words? tm has a collection of tools for taking care of these tasks, but you need to decide which to use.

To get started, put the text into a corpus object. A corpus is usually created when using tm when a collection of documents is read into R. tm nicely handles many different types of documents, including PDF and Word files.

WineCorpus <- Corpus(VectorSource(Wine$description))
WineCorpus

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 20508

A corpus object in tm is a decorated list, a list that has been adorned with special attributes when created. That means we can peek into the corpus as if it were a list by referring to the number elements of the list (which are the documents).

is.list(WineCorpus)

[1] TRUE

WineCorpus[[1]]

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 119

To see the text itself, use the inspect command.

inspect(WineCorpus[[1]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 119

Lemon oil and grapefruit aromas follow through on a medium-bodied palate with impressive wieght and a dry, tart finish.

Now comes the fun: tokenizing the text of the corpus. How should this text be represented as words? There are many choices. (BTW, removeWords only removes selected words, not them all!)

getTransformations()    # defined in tm

[1] "removeNumbers"     "removePunctuation" "removeWords"       "stemDocument"     
[5] "stripWhitespace"

If something you want to do is not among these – or if you want finer control – define your own content transformation by mimicking the following style.

# to convert a misspelled word
toCorrect <- content_transformer(function(text, from, to) str_replace_all(text, from, to))
# to convert some pattern of text to a space
toSpace   <- content_transformer(function(text, pattern) str_replace_all(text, pattern, " "))
# to convert text to lower case
toLower   <- content_transformer(function(text) tolower(text))

Removing common stop words (such as “the”, “a”, “an”, …) is often done as well, using the function removeWords. Here’s a sample of the stopwords that are defined in tm. (tm has collections of stopwords for other languages, so you have to pick “english” in this case.)

length(stopwords('english'))

[1] 174

stopwords('english')    # show just the first 10

  [1] "i"          "me"         "my"         "myself"     "we"         "our"        "ours"      
  [8] "ourselves"  "you"        "your"       "yours"      "yourself"   "yourselves" "he"        
 [15] "him"        "his"        "himself"    "she"        "her"        "hers"       "herself"   
 [22] "it"         "its"        "itself"     "they"       "them"       "their"      "theirs"    
 [29] "themselves" "what"       "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"        "was"        "were"      
 [43] "be"         "been"       "being"      "have"       "has"        "had"        "having"    
 [50] "do"         "does"       "did"        "doing"      "would"      "should"     "could"     
 [57] "ought"      "i'm"        "you're"     "he's"       "she's"      "it's"       "we're"     
 [64] "they're"    "i've"       "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"       "you'll"     "he'll"     
 [78] "she'll"     "we'll"      "they'll"    "isn't"      "aren't"     "wasn't"     "weren't"   
 [85] "hasn't"     "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"     "won't"     
 [92] "wouldn't"   "shan't"     "shouldn't"  "can't"      "cannot"     "couldn't"   "mustn't"   
 [99] "let's"      "that's"     "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"         "the"        "and"       
[113] "but"        "if"         "or"         "because"    "as"         "until"      "while"     
[120] "of"         "at"         "by"         "for"        "with"       "about"      "against"   
[127] "between"    "into"       "through"    "during"     "before"     "after"      "above"     
[134] "below"      "to"         "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"      "further"    "then"      
[148] "once"       "here"       "there"      "when"       "where"      "why"        "how"       
[155] "all"        "any"        "both"       "each"       "few"        "more"       "most"      
[162] "other"      "some"       "such"       "no"         "nor"        "not"        "only"      
[169] "own"        "same"       "so"         "than"       "too"        "very"

Stemming converts words to remove variations produced by plurals or adding tense to a base verb (trims off the trailing ‘s’, ‘es’, resulting in a smaller vocabulary.

Typically many of these transformations are applied, often with certain words in mind.

WineCorpus <- tm_map(WineCorpus, toLower)
WineCorpus <- tm_map(WineCorpus, toCorrect, "wieght", "weight")
WineCorpus <- tm_map(WineCorpus, toSpace, '-|/')            # otherwise runs together
WineCorpus <- tm_map(WineCorpus, removePunctuation)         # might not be right (!)
WineCorpus <- tm_map(WineCorpus, stripWhitespace)
# WineCorpus <- tm_map(WineCorpus, removeWords, stopwords("english"))  # leave for now
# WineCorpus <- tm_map(WineCorpus, removeNumbers)            # not many around
# WineCorpus <- tm_map(WineCorpus, removeWords, c('yuck'))   # specific word(s)

inspect(WineCorpus[[1]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 117

lemon oil and grapefruit aromas follow through on a medium bodied palate with impressive weight and a dry tart finish

Document term matrix

The key object for our analysis is known as a document term matrix (or, when transposed, a term document matrix). It contains the counts of every word type in each document. Each of the 20,508 rows represents a document, and each of the 6,385 columns identifies a word type. Because most of the matrix entries are zeros, it is held in “sparse” format. (Notice that you cannot recover the source corpus from the document term matrix. This matrix represents each document as a “bag of words”.)

dtm <- DocumentTermMatrix(WineCorpus)
dim(dtm)

[1] 20508  5641

We can start to do statistics now – counting. For example, all but 545,707 elements of the \(20,508 \times 5,641 = 115,685,628 \approx 116\) million counts in the document term matrix are zero. (This is the count if you have not removed the stopwords; the number of types is smaller with the stopwords removed.)

dtm

<<DocumentTermMatrix (documents: 20508, terms: 5641)>>
Non-/sparse entries: 545707/115139921
Sparsity           : 100%
Maximal term length: 20
Weighting          : term frequency (tf)

It is now simple to use matrix functions from R to find the number of words in each document and the number of times each type appears (albeit at the cost of converting the sparse matrix into a dense matrix in order to use rowSums and colSums.).

ni <- rowSums(as.matrix(dtm))  # tokens in each document
mj <- colSums(as.matrix(dtm))  # columns are named by the word types; frequency of each

Check a few of the terms to make sure that the data appear okay. If you don’t spend time getting the data ready, you will find lots of issues.

j <- which.max(str_length(names(mj)))
j

[1] 4149

names(mj)[j]

[1] "blackberrypeppercorn"

To see if this is real text, you have to find the one document that has this text. You can see that its just in one from the count for this type.

mj[j]

blackberrypeppercorn 
                   1

which(0 != as.vector(dtm[,4149]))

[1] 12333

Here’s the relevant portion of the original source:

“Creme brulee, blackberry,peppercorn, and mocha aromas. A soft, silky…”

There’s no space around that comma between “blackberry” and “peppercorn” and tm has collapsed the two words together. We could fix this by adding a comma to the list of punctuation to turn into spaces rather than just remove.

It is hard to imagine a distribution of counts that is more skewed than the counts of the word types (left).

par(mfrow=c(1,2))
  hist(mj, breaks=50, main="Counts of Word Types")
  hist(ni, breaks=50, main="Words per Document")

Even after taking logs, the counts remain skewed! This is common in text. “Tokens are common, but types are rare.”

hist(log(mj), breaks=50, main="Counts of Word Types")

The frequency counts in mj are named and in alphabetical order. We can use these names to produce a bar graph of the most common words with ggplot. Stopwords such as “and” and “this” are common.

Freq <- data_frame(type = names(mj), count = mj)     # ggplot and dplyr want data frames
Freq %>%
    top_n(25, count)                         %>%
    mutate(type=reorder(type,count))         %>%     # rather than alphabetical order
    ggplot(aes(type,count)) + geom_col() + coord_flip()

Let’s see what happens without the stop words. (The following code should run, but dies a horrible death! I suspect R ran out of memory along the way. Instead, to remove the stopwords, use tm_map as shown above.)

Freq %>%
    filter(!type %in% stopwords('english'))  %>%     # be careful about syntax here or it will crash!
    top_n(25, count)                         %>%
    mutate(type=reorder(type,count))         %>%     # rather than alphabetical order
    ggplot(aes(type,count)) + geom_col() + coord_flip()

This is a good chance to check whether the frequencies of the word types matches a Zipf distribution commonly associated with text.

A Zipf distribution is characterized by a power law: the frequency of word types is inversely proportional to rank, \(f_k \propto 1/k\). Said differently, the frequency of the second most common word is half that of the most common, the frequency of the third is one-third the most common, etc. A little algebra shows that for this to occur, then \(\log p_k \approx b_0 - \log k\). That is, a plot of the log of the frequencies should be linear in the log of the rank \(k\), with slope near -1.

Freq %>% 
    arrange(desc(count))                   %>%   # decreasing by count
    mutate(rank=row_number())              %>%   # add row number
    ggplot(aes(x=log(rank), y=log(count))) + 
    geom_point() +
    # geom_smooth(method='lm', se=FALSE) +
    geom_abline(slope=-1, intercept=11, color='red')

The least squares slope (commented out or shown in blue) is steeper, being dominated by the many less common words. You can mitigate that effect by weighting the regression by the counts.

Temp <- Freq %>%  mutate(rank=row_number())
lm(log(count) ~ log(rank), data=Temp, weights=sqrt(count))


Call:
lm(formula = log(count) ~ log(rank), data = Temp, weights = sqrt(count))

Coefficients:
(Intercept)    log(rank)  
     12.104       -1.166

Words and prices

Before concluding this short introduction, lets relate the word types to prices. Which word types are associated with pricy wines, and which with cheaper wines?

To find out, combine the information in the document term matrix with prices from the Wine data frame. First fix that weirdo price found using JMP earlier.

max(Wine$price, na.rm=TRUE)        # missing values are contageous in R

[1] 2006

i <- which.max(Wine$price)         # handles the NA by default
i

[1] 12508

Wine$price[i] <- NA
max(Wine$price, na.rm=TRUE)

[1] 571

To keep things manageable, consider words that appear, say, at least 10 times in the corpus. That’s still 1,776 types. The matrix counts has the counts for each word type. We can find the word with the highest average price by turning these into 0/1 indicators and using a matrix product.

counts <- as.matrix(dtm[,9<mj])
dim(counts)

[1] 20508  1742

count.names <- colnames(counts)   # save for later
count.names[1:10]

 [1] "and"        "aromas"     "bodied"     "dry"        "finish"     "follow"     "grapefruit"
 [8] "impressive" "lemon"      "medium"

The word “and” appears many times, so turn these integers into indicators (1 if the word type appears in a note at least once and 0 if not present).

counts[1:10]

 [1] 2 2 1 3 2 3 1 1 2 4

pmin does element-by-element comparison, but converts the matrix into a vector. So, turn the vector of 0/1s back into a matrix.

counts <- as.matrix(pmin(counts,1),nrow=20508)
counts[1:10]

 [1] 1 1 1 1 1 1 1 1 1 1

A quick check of the prices, identifying those not missing.

not.missing <- ! is.na(Wine$price)
min(Wine$price[not.missing])        # Two-buck chuck?

[1] 1.99

Finally, find the average price.

avg.price <- (Wine$price[not.missing] %*% counts[not.missing,])/colSums(counts[not.missing,])

Note that some averages are ‘NaN’ because a word type does not appear in wines with known prices. For example, the word promising is common, but not found in the descriptions of wines with known price.

min(avg.price)

[1] NaN

Which words come with the high prices?

names(avg.price) <- count.names
sort(avg.price, decreasing=TRUE)[1:20]

  incredibly    potential       finest     measures       decade     requires       mignon 
    82.12375     79.06857     70.57333     69.14000     69.08652     67.20000     67.12108 
   exquisite        filet        class      spinach    champagne     barrique   underlying 
    66.09000     62.21878     61.85643     61.73389     60.68243     60.47250     59.61385 
      rancio     unctuous   underneath proportioned       stream  outstanding 
    59.20750     58.39800     55.92455     55.17400     54.81385     54.52350

And which with low prices?

sort(avg.price)[1:20]

       money aromatically    dishwater    reductive       burger     chemical     carefree 
    9.035909     9.450000     9.796000    10.323333    10.543548    10.807778    10.902105 
     quaffer       picnic     everyday       summer        value        burst       meager 
   11.030993    11.120179    11.190667    11.271585    11.516320    11.656528    11.661429 
     sherbet        price     quaffing      moscato        shirt      leaning 
   11.666667    11.893312    11.988636    12.069231    12.171818    12.328333

Next time

These notes continue with this analysis, considering the singular value decomposition of the counts in the document term matrix.

LS0tCnRpdGxlOiAiVGV4dCBhcyBEYXRhOiBGdW5kYW1lbnRhbHMiCm91dHB1dDogaHRtbF9ub3RlYm9vawphdXRob3I6IFJvYmVydCBTdGluZQpkYXRlOiBKdWx5IDIwMTcKLS0tCgpUaGlzIG5vdGVib29rIGRlc2NyaWJlcyBzb21lIGJhc2ljcyBvZiB0ZXh0IG1vZGVsaW5nIGluIFIuICBJbiBwYXJ0aWN1bGFyLCB0aGUgZXhhbXBsZSBoaWdobGlnaHRzIGJ1aWxkaW5nIGEgZG9jdW1lbnQtdGVybSBtYXRyaXggdXNpbmcgdGhlIHBhY2thZ2UgYHRtYC4gIFRoZSBleGFtcGxlIGNvbmNsdWRlcyBieSBmaW5kaW5nIHdvcmRzIGFzc29jaWF0ZWQgd2l0aCBsb3cgYW5kIGhpZ2ggdmFsdWVzIG9mIGEgbnVtZXJpY2FsIHZhcmlhYmxlLCBpbiB0aGlzIGNhc2UsIHByaWNlcyBvZiB3aW5lLgoKCiMgU2V0dXAgUgoKSSB1c2UgdGhlc2UgcGFja2FnZXMgc28gZnJlcXVlbnRseSBJIHdpbGwgbG9hZCB0aGVtIGludG8gUiB1cCBmcm9udC4gIE90aGVycyB0aGF0IGFyZSBvY2Nhc2lvbmFsbHkgdXNlZnVsIHdpbGwgYmUgbG9hZGVkIGFzIG5lZWRlZC4gIChVbmxpa2UgYGxpYnJhcnlgLCBgcmVxdWlyZWAgbG9hZHMgYSBwYWNrYWdlIG9ubHkgaWYgaWYgaXQgd2FzIG5vdCBhbHJlYWR5IHByZXNlbnQgaW4gdGhlIHdvcmtpbmcgZW52aXJvbm1lbnQuKQoKYGBge3J9CnJlcXVpcmUodG0pCgpyZXF1aXJlKHRpZHl2ZXJzZSkKcmVxdWlyZShzdHJpbmdyKQpgYGAKCgojIEJ1aWxkaW5nIGEgZG9jdW1lbnQgY29ycHVzCgpTdGFydCBieSByZWFkaW5nIHRoZSB3aW5lIGRhdGEuIEl0IGlzIGluIGEgQ1NWIGZpbGUuIFRoZSBmdW5jdGlvbiBgcmVhZF9jc3ZgIGNyZWF0ZXMgYSAidGlkeSIgZGF0YSBmcmFtZSB0aGF0LCBmb3IgZXhhbXBsZSwgZG9lcyBub3QgY29udmVydCB0ZXh0IHRvIGZhY3RvcnMgYnkgZGVmYXVsdC4gSXQgY29tZXMgZnJvbSB0aGUgYHJlYWRyYCBwYWNrYWdlLCBwYXJ0IG9mIGB0aWR5dmVyc2VgLiAoWW91IHdpbGwgbmVlZCB0byBkb3dubG9hZCB0aGVzZSBkYXRhIHRvIHlvdXIgY29tcHV0ZXIgYW5kIHVzZSB0aGUgYXBwcm9wcmlhdGUgcGF0aC4pCgpgYGB7cn0KV2luZSA8LSByZWFkX2NzdigiLi4vZGF0YS9XaW5lLmNzdiIpICAgIyBJIGNhcGl0YWxpemUgdGhlIG5hbWVzIG9mIGRhdGEgZnJhbWVzIApkaW0oV2luZSkKYGBgCgpUaGUgZnVuY3Rpb24gYHJlYWRfY3N2YCBnZXRzIHVwc2V0IGJlY2F1c2UgaXQgZXhwZWN0cyB0aGUgYWxjb2hvbCB2YXJpYWJsZSB0byBiZSBpbnRlZ2VyIHZhbHVlZCwgYnV0IHRoZW4gaXQgZGlzY292ZXJzIHNvbWUgZGVjaW1hbCBwb2ludHMuICBXZSBjYW4gZ2V0IGByZWFkX2NzdmAgdG8gaWdub3JlIHRoaXMgcHJvYmxlbSBieSB0ZWxsaW5nIGl0IHRoYXQgdmFsdWVzIG9mIHRoZSBhbGNvaG9sIHZhcmlhYmxlIGFyZSBkb3VibGVzIChvciBwZXJoYXBzIGJldHRlciBmaXggdGhvc2UgZGF0YSB2YWx1ZXMgdG8gcmVtb3ZlIHRoZSBkZWNpbWFscykuCgpgYGB7cn0KV2luZSA8LSByZWFkX2NzdigiLi4vZGF0YS9XaW5lLmNzdiIsIGNvbF90eXBlcz1jb2xzKGFsY29ob2w9J2QnKSkKZGltKFdpbmUpCmBgYAoKYGBge3J9CnN1bW1hcnkoV2luZSkKYGBgCgpOb3cgZm9jdXMgb24gdGhlIGNvbHVtbiBgZGVzY3JpcHRpb25gIHRoYXQgaG9sZHMgdGhlIHRhc3Rpbmcgbm90ZXMuCgpgYGB7cn0KV2luZSRkZXNjcmlwdGlvblsxOjRdCmBgYAoKCmBgYHtyfQpXaW5lJGRlc2NyaXB0aW9uWzE6NV0KYGBgCgpUaGUgcHJpY2VzIGFyZSBhIGJpdCBza2V3ZWQuCgpgYGB7cn0KV2luZSAlPiUKICAgIGdncGxvdChhZXMoeD1wcmljZSkpICsgZ2VvbV9oaXN0b2dyYW0oKSArIHNjYWxlX3hfbG9nMTAoKQpgYGAKCkFuZCBkaWZmZXIgc2xpZ2h0bHkgYmV0d2VlbiByZWQgdmVyc3VzIHdoaXRlIHdpbmVzLiAgVGhlIGNvbXBhcmlzb24gaXMgZWFzaWVyIHRvIHZpc3VhbGl6ZSB3aXRoIGZyZXF1ZW5jeSBwb2x5Z29ucyB0aGF0IGFyZSBub3QgZmlsbGVkIGluLgoKYGBge3J9CldpbmUgJT4lCiAgICBmaWx0ZXIoIWlzLm5hKGNvbG9yKSkgJT4lCiAgICBnZ3Bsb3QoYWVzKHg9cHJpY2UsIC4uZGVuc2l0eS4uLCBjb2xvcj1jb2xvcikpICsgZ2VvbV9mcmVxcG9seSgpICsgc2NhbGVfeF9sb2cxMCgpCmBgYAoKQXMgdXN1YWwgd2l0aCBSLCB0aGVyZSBhcmUgbG90cyBvZiBhbHRlcm5hdGl2ZSBkaXNwbGF5cywgc3VjaCBhcyBzaWRlLWJ5LXNpZGUgYm94cGxvdHMgdGhhdCB0ZWxsIGEgc2ltaWxhciBzdG9yeS4gIChgZ2dwbG90MmAgZ2VuZXJhdGVzICJwcmV0dHkiIGdyYXBocywgYnV0IHlvdSBtaWdodCBmaW5kIHRoZSBzeW50YXggb3ZlcndoZWxtaW5nIGF0IGZpcnN0LikKCmBgYHtyfQpib3hwbG90KHByaWNlIH4gY29sb3IsIGRhdGE9V2luZSwgbG9nPSd5JykKYGBgCgoKVGhlIG5leHQgdGFzayBzZWVtcyBvYnZpb3VzLCBidXQgaGFzIGEgbGFzdGluZyBpbXBhY3Q6ICBXaGF0J3MgYSB3b3JkPyAgSXMgYSBudW1iZXIgYSB3b3JkPyAgV2hhdCBhYm91dCBwdW5jdHVhdGlvbj8gIEFyZSAic3ViamVjdCIgYW5kICJzdWJqZWN0cyIgZGlmZmVyZW50IHdvcmRzPyAgYHRtYCBoYXMgYSBjb2xsZWN0aW9uIG9mIHRvb2xzIGZvciB0YWtpbmcgY2FyZSBvZiB0aGVzZSB0YXNrcywgYnV0IHlvdSBuZWVkIHRvIGRlY2lkZSB3aGljaCB0byB1c2UuIAoKVG8gZ2V0IHN0YXJ0ZWQsIHB1dCB0aGUgdGV4dCBpbnRvIGEgYGNvcnB1c2Agb2JqZWN0LiAgQSBjb3JwdXMgaXMgdXN1YWxseSBjcmVhdGVkIHdoZW4gdXNpbmcgYHRtYCB3aGVuIGEgY29sbGVjdGlvbiBvZiBkb2N1bWVudHMgaXMgcmVhZCBpbnRvIFIuICBgdG1gIG5pY2VseSBoYW5kbGVzIG1hbnkgZGlmZmVyZW50IHR5cGVzIG9mIGRvY3VtZW50cywgIGluY2x1ZGluZyBQREYgYW5kIFdvcmQgZmlsZXMuCgpgYGB7cn0KV2luZUNvcnB1cyA8LSBDb3JwdXMoVmVjdG9yU291cmNlKFdpbmUkZGVzY3JpcHRpb24pKQoKV2luZUNvcnB1cwpgYGAKCkEgY29ycHVzIG9iamVjdCBpbiBgdG1gIGlzIGEgZGVjb3JhdGVkIGxpc3QsIGEgbGlzdCB0aGF0IGhhcyBiZWVuIGFkb3JuZWQgd2l0aCBzcGVjaWFsIGF0dHJpYnV0ZXMgd2hlbiBjcmVhdGVkLiAgVGhhdCBtZWFucyB3ZSBjYW4gcGVlayBpbnRvIHRoZSBjb3JwdXMgYXMgaWYgaXQgd2VyZSBhIGxpc3QgYnkgcmVmZXJyaW5nIHRvIHRoZSBudW1iZXIgZWxlbWVudHMgb2YgdGhlIGxpc3QgKHdoaWNoIGFyZSB0aGUgZG9jdW1lbnRzKS4KCmBgYHtyfQppcy5saXN0KFdpbmVDb3JwdXMpCmBgYApgYGB7cn0KV2luZUNvcnB1c1tbMV1dCmBgYAoKVG8gc2VlIHRoZSB0ZXh0IGl0c2VsZiwgdXNlIHRoZSBgaW5zcGVjdGAgY29tbWFuZC4KCmBgYHtyfQppbnNwZWN0KFdpbmVDb3JwdXNbWzFdXSkKYGBgCgpOb3cgY29tZXMgdGhlIGZ1bjogdG9rZW5pemluZyB0aGUgdGV4dCBvZiB0aGUgY29ycHVzLiAgSG93IHNob3VsZCB0aGlzIHRleHQgYmUgcmVwcmVzZW50ZWQgYXMgd29yZHM/ICBUaGVyZSBhcmUgbWFueSBjaG9pY2VzLiAgKEJUVywgYHJlbW92ZVdvcmRzYCBvbmx5IHJlbW92ZXMgc2VsZWN0ZWQgd29yZHMsIG5vdCB0aGVtIGFsbCEpCgpgYGB7cn0KZ2V0VHJhbnNmb3JtYXRpb25zKCkgICAgIyBkZWZpbmVkIGluIHRtCmBgYAoKSWYgc29tZXRoaW5nIHlvdSB3YW50IHRvIGRvIGlzIG5vdCBhbW9uZyB0aGVzZSAtLSBvciBpZiB5b3Ugd2FudCBmaW5lciBjb250cm9sIC0tIGRlZmluZSB5b3VyIG93biBjb250ZW50IHRyYW5zZm9ybWF0aW9uIGJ5IG1pbWlja2luZyB0aGUgZm9sbG93aW5nIHN0eWxlLgoKYGBge3J9CiMgdG8gY29udmVydCBhIG1pc3NwZWxsZWQgd29yZAp0b0NvcnJlY3QgPC0gY29udGVudF90cmFuc2Zvcm1lcihmdW5jdGlvbih0ZXh0LCBmcm9tLCB0bykgc3RyX3JlcGxhY2VfYWxsKHRleHQsIGZyb20sIHRvKSkKIyB0byBjb252ZXJ0IHNvbWUgcGF0dGVybiBvZiB0ZXh0IHRvIGEgc3BhY2UKdG9TcGFjZSAgIDwtIGNvbnRlbnRfdHJhbnNmb3JtZXIoZnVuY3Rpb24odGV4dCwgcGF0dGVybikgc3RyX3JlcGxhY2VfYWxsKHRleHQsIHBhdHRlcm4sICIgIikpCiMgdG8gY29udmVydCB0ZXh0IHRvIGxvd2VyIGNhc2UKdG9Mb3dlciAgIDwtIGNvbnRlbnRfdHJhbnNmb3JtZXIoZnVuY3Rpb24odGV4dCkgdG9sb3dlcih0ZXh0KSkKYGBgCgpSZW1vdmluZyBjb21tb24gc3RvcCB3b3JkcyAoc3VjaCBhcyAidGhlIiwgImEiLCAiYW4iLCAuLi4pIGlzIG9mdGVuIGRvbmUgYXMgd2VsbCwgdXNpbmcgdGhlIGZ1bmN0aW9uIGByZW1vdmVXb3Jkc2AuICBIZXJlJ3MgYSBzYW1wbGUgb2YgdGhlIHN0b3B3b3JkcyB0aGF0IGFyZSBkZWZpbmVkIGluIGB0bWAuICAoYHRtYCBoYXMgY29sbGVjdGlvbnMgb2Ygc3RvcHdvcmRzIGZvciBvdGhlciBsYW5ndWFnZXMsIHNvIHlvdSBoYXZlIHRvIHBpY2sgImVuZ2xpc2giIGluIHRoaXMgY2FzZS4pCgpgYGB7cn0KbGVuZ3RoKHN0b3B3b3JkcygnZW5nbGlzaCcpKQpzdG9wd29yZHMoJ2VuZ2xpc2gnKSAgICAjIHNob3cganVzdCB0aGUgZmlyc3QgMTAgCmBgYAoKU3RlbW1pbmcgY29udmVydHMgd29yZHMgdG8gcmVtb3ZlIHZhcmlhdGlvbnMgcHJvZHVjZWQgYnkgcGx1cmFscyBvciBhZGRpbmcgdGVuc2UgdG8gYSBiYXNlIHZlcmIgKHRyaW1zIG9mZiB0aGUgdHJhaWxpbmcgJ3MnLCAnZXMnLCByZXN1bHRpbmcgaW4gYSBzbWFsbGVyIHZvY2FidWxhcnkuICAKClR5cGljYWxseSBtYW55IG9mIHRoZXNlIHRyYW5zZm9ybWF0aW9ucyBhcmUgYXBwbGllZCwgb2Z0ZW4gd2l0aCBjZXJ0YWluIHdvcmRzIGluIG1pbmQuCgpgYGB7cn0KV2luZUNvcnB1cyA8LSB0bV9tYXAoV2luZUNvcnB1cywgdG9Mb3dlcikKV2luZUNvcnB1cyA8LSB0bV9tYXAoV2luZUNvcnB1cywgdG9Db3JyZWN0LCAid2llZ2h0IiwgIndlaWdodCIpCldpbmVDb3JwdXMgPC0gdG1fbWFwKFdpbmVDb3JwdXMsIHRvU3BhY2UsICctfC8nKSAgICAgICAgICAgICMgb3RoZXJ3aXNlIHJ1bnMgdG9nZXRoZXIKV2luZUNvcnB1cyA8LSB0bV9tYXAoV2luZUNvcnB1cywgcmVtb3ZlUHVuY3R1YXRpb24pICAgICAgICAgIyBtaWdodCBub3QgYmUgcmlnaHQgKCEpCldpbmVDb3JwdXMgPC0gdG1fbWFwKFdpbmVDb3JwdXMsIHN0cmlwV2hpdGVzcGFjZSkKIyBXaW5lQ29ycHVzIDwtIHRtX21hcChXaW5lQ29ycHVzLCByZW1vdmVXb3Jkcywgc3RvcHdvcmRzKCJlbmdsaXNoIikpICAjIGxlYXZlIGZvciBub3cKIyBXaW5lQ29ycHVzIDwtIHRtX21hcChXaW5lQ29ycHVzLCByZW1vdmVOdW1iZXJzKSAgICAgICAgICAgICMgbm90IG1hbnkgYXJvdW5kCiMgV2luZUNvcnB1cyA8LSB0bV9tYXAoV2luZUNvcnB1cywgcmVtb3ZlV29yZHMsIGMoJ3l1Y2snKSkgICAjIHNwZWNpZmljIHdvcmQocykKYGBgCgpgYGB7cn0KaW5zcGVjdChXaW5lQ29ycHVzW1sxXV0pCmBgYAoKCiMgRG9jdW1lbnQgdGVybSBtYXRyaXgKClRoZSBrZXkgb2JqZWN0IGZvciBvdXIgYW5hbHlzaXMgaXMga25vd24gYXMgYSBkb2N1bWVudCB0ZXJtIG1hdHJpeCAob3IsIHdoZW4gdHJhbnNwb3NlZCwgYSB0ZXJtIGRvY3VtZW50IG1hdHJpeCkuICBJdCBjb250YWlucyB0aGUgY291bnRzIG9mIGV2ZXJ5IHdvcmQgdHlwZSBpbiBlYWNoIGRvY3VtZW50LiBFYWNoIG9mIHRoZSAyMCw1MDggcm93cyByZXByZXNlbnRzIGEgZG9jdW1lbnQsIGFuZCBlYWNoIG9mIHRoZSA2LDM4NSBjb2x1bW5zIGlkZW50aWZpZXMgYSB3b3JkIHR5cGUuIEJlY2F1c2UgbW9zdCBvZiB0aGUgbWF0cml4IGVudHJpZXMgYXJlIHplcm9zLCBpdCBpcyBoZWxkIGluICJzcGFyc2UiIGZvcm1hdC4gIChOb3RpY2UgdGhhdCB5b3UgKmNhbm5vdCogcmVjb3ZlciB0aGUgc291cmNlIGNvcnB1cyBmcm9tIHRoZSBkb2N1bWVudCB0ZXJtIG1hdHJpeC4gVGhpcyBtYXRyaXggcmVwcmVzZW50cyBlYWNoIGRvY3VtZW50IGFzIGEgImJhZyBvZiB3b3JkcyIuKQoKYGBge3J9CmR0bSA8LSBEb2N1bWVudFRlcm1NYXRyaXgoV2luZUNvcnB1cykKZGltKGR0bSkKYGBgCgpXZSBjYW4gc3RhcnQgdG8gZG8gc3RhdGlzdGljcyBub3cgLS0gY291bnRpbmcuICAgRm9yIGV4YW1wbGUsIGFsbCBidXQgNTQ1LDcwNyBlbGVtZW50cyBvZiB0aGUgJDIwLDUwOCBcdGltZXMgNSw2NDEgPSAxMTUsNjg1LDYyOCBcYXBwcm94IDExNiQgbWlsbGlvbiBjb3VudHMgaW4gdGhlIGRvY3VtZW50IHRlcm0gbWF0cml4IGFyZSB6ZXJvLiAgKFRoaXMgaXMgdGhlIGNvdW50IGlmIHlvdSBoYXZlIG5vdCByZW1vdmVkIHRoZSBzdG9wd29yZHM7IHRoZSBudW1iZXIgb2YgdHlwZXMgaXMgc21hbGxlciB3aXRoIHRoZSBzdG9wd29yZHMgcmVtb3ZlZC4pCgpgYGB7cn0KZHRtCmBgYAoKSXQgaXMgbm93IHNpbXBsZSB0byB1c2UgbWF0cml4IGZ1bmN0aW9ucyBmcm9tIFIgdG8gZmluZCB0aGUgbnVtYmVyIG9mIHdvcmRzIGluIGVhY2ggZG9jdW1lbnQgYW5kIHRoZSBudW1iZXIgb2YgdGltZXMgZWFjaCB0eXBlIGFwcGVhcnMgKGFsYmVpdCBhdCB0aGUgY29zdCBvZiBjb252ZXJ0aW5nIHRoZSBzcGFyc2UgbWF0cml4IGludG8gYSBkZW5zZSBtYXRyaXggaW4gb3JkZXIgdG8gdXNlIGByb3dTdW1zYCBhbmQgYGNvbFN1bXNgLikuICAKCmBgYHtyfQpuaSA8LSByb3dTdW1zKGFzLm1hdHJpeChkdG0pKSAgIyB0b2tlbnMgaW4gZWFjaCBkb2N1bWVudAptaiA8LSBjb2xTdW1zKGFzLm1hdHJpeChkdG0pKSAgIyBjb2x1bW5zIGFyZSBuYW1lZCBieSB0aGUgd29yZCB0eXBlczsgZnJlcXVlbmN5IG9mIGVhY2gKYGBgCgpDaGVjayBhIGZldyBvZiB0aGUgdGVybXMgdG8gbWFrZSBzdXJlIHRoYXQgdGhlIGRhdGEgYXBwZWFyIG9rYXkuICBJZiB5b3UgZG9uJ3Qgc3BlbmQgdGltZSBnZXR0aW5nIHRoZSBkYXRhIHJlYWR5LCB5b3Ugd2lsbCBmaW5kIGxvdHMgb2YgaXNzdWVzLgoKYGBge3J9CmogPC0gd2hpY2gubWF4KHN0cl9sZW5ndGgobmFtZXMobWopKSkKagpuYW1lcyhtailbal0KYGBgCgpUbyBzZWUgaWYgdGhpcyBpcyByZWFsIHRleHQsIHlvdSBoYXZlIHRvIGZpbmQgdGhlIG9uZSBkb2N1bWVudCB0aGF0IGhhcyB0aGlzIHRleHQuICBZb3UgY2FuIHNlZSB0aGF0IGl0cyBqdXN0IGluIG9uZSBmcm9tIHRoZSBjb3VudCBmb3IgdGhpcyB0eXBlLgoKYGBge3J9Cm1qW2pdCmBgYAoKYGBge3J9CndoaWNoKDAgIT0gYXMudmVjdG9yKGR0bVssNDE0OV0pKQpgYGAKCkhlcmUncyB0aGUgcmVsZXZhbnQgcG9ydGlvbiBvZiB0aGUgb3JpZ2luYWwgc291cmNlOgoKICAiQ3JlbWUgYnJ1bGVlLCBibGFja2JlcnJ5LHBlcHBlcmNvcm4sIGFuZCBtb2NoYSBhcm9tYXMuIEEgc29mdCwgc2lsa3kuLi4iCgpUaGVyZSdzIG5vIHNwYWNlIGFyb3VuZCB0aGF0IGNvbW1hIGJldHdlZW4gImJsYWNrYmVycnkiIGFuZCAicGVwcGVyY29ybiIgYW5kIGB0bWAgaGFzIGNvbGxhcHNlZCB0aGUgdHdvIHdvcmRzIHRvZ2V0aGVyLiAgV2UgY291bGQgZml4IHRoaXMgYnkgYWRkaW5nIGEgY29tbWEgdG8gdGhlIGxpc3Qgb2YgcHVuY3R1YXRpb24gdG8gdHVybiBpbnRvIHNwYWNlcyByYXRoZXIgdGhhbiBqdXN0IHJlbW92ZS4KCkl0IGlzIGhhcmQgdG8gaW1hZ2luZSBhIGRpc3RyaWJ1dGlvbiBvZiBjb3VudHMgdGhhdCBpcyBtb3JlIHNrZXdlZCB0aGFuIHRoZSBjb3VudHMgb2YgdGhlIHdvcmQgdHlwZXMgKGxlZnQpLgoKYGBge3J9CnBhcihtZnJvdz1jKDEsMikpCiAgaGlzdChtaiwgYnJlYWtzPTUwLCBtYWluPSJDb3VudHMgb2YgV29yZCBUeXBlcyIpCiAgaGlzdChuaSwgYnJlYWtzPTUwLCBtYWluPSJXb3JkcyBwZXIgRG9jdW1lbnQiKQpgYGAKCkV2ZW4gYWZ0ZXIgdGFraW5nIGxvZ3MsIHRoZSBjb3VudHMgcmVtYWluIHNrZXdlZCEgIFRoaXMgaXMgY29tbW9uIGluIHRleHQuICAiVG9rZW5zIGFyZSBjb21tb24sIGJ1dCB0eXBlcyBhcmUgcmFyZS4iCgpgYGB7cn0KaGlzdChsb2cobWopLCBicmVha3M9NTAsIG1haW49IkNvdW50cyBvZiBXb3JkIFR5cGVzIikKYGBgCgpUaGUgZnJlcXVlbmN5IGNvdW50cyBpbiBgbWpgIGFyZSBuYW1lZCBhbmQgaW4gYWxwaGFiZXRpY2FsIG9yZGVyLiAgV2UgY2FuIHVzZSB0aGVzZSBuYW1lcyB0byBwcm9kdWNlIGEgIGJhciBncmFwaCAqb2YgdGhlIG1vc3QgY29tbW9uIHdvcmRzKiB3aXRoIGBnZ3Bsb3RgLiAgU3RvcHdvcmRzIHN1Y2ggYXMgImFuZCIgYW5kICJ0aGlzIiBhcmUgY29tbW9uLgoKYGBge3J9CkZyZXEgPC0gZGF0YV9mcmFtZSh0eXBlID0gbmFtZXMobWopLCBjb3VudCA9IG1qKSAgICAgIyBnZ3Bsb3QgYW5kIGRwbHlyIHdhbnQgZGF0YSBmcmFtZXMKCkZyZXEgJT4lCiAgICB0b3BfbigyNSwgY291bnQpICAgICAgICAgICAgICAgICAgICAgICAgICU+JQogICAgbXV0YXRlKHR5cGU9cmVvcmRlcih0eXBlLGNvdW50KSkgICAgICAgICAlPiUgICAgICMgcmF0aGVyIHRoYW4gYWxwaGFiZXRpY2FsIG9yZGVyCiAgICBnZ3Bsb3QoYWVzKHR5cGUsY291bnQpKSArIGdlb21fY29sKCkgKyBjb29yZF9mbGlwKCkKYGBgCgpMZXQncyBzZWUgd2hhdCBoYXBwZW5zIHdpdGhvdXQgdGhlIHN0b3Agd29yZHMuICAoVGhlIGZvbGxvd2luZyBjb2RlICpzaG91bGQqIHJ1biwgYnV0IGRpZXMgYSBob3JyaWJsZSBkZWF0aCEgIEkgc3VzcGVjdCBSIHJhbiBvdXQgb2YgbWVtb3J5IGFsb25nIHRoZSB3YXkuIEluc3RlYWQsIHRvIHJlbW92ZSB0aGUgc3RvcHdvcmRzLCB1c2UgYHRtX21hcGAgYXMgc2hvd24gYWJvdmUuKQoKYGBge3J9CkZyZXEgJT4lCiAgICBmaWx0ZXIoIXR5cGUgJWluJSBzdG9wd29yZHMoJ2VuZ2xpc2gnKSkgICU+JSAgICAgIyBiZSBjYXJlZnVsIGFib3V0IHN5bnRheCBoZXJlIG9yIGl0IHdpbGwgY3Jhc2ghCiAgICB0b3BfbigyNSwgY291bnQpICAgICAgICAgICAgICAgICAgICAgICAgICU+JQogICAgbXV0YXRlKHR5cGU9cmVvcmRlcih0eXBlLGNvdW50KSkgICAgICAgICAlPiUgICAgICMgcmF0aGVyIHRoYW4gYWxwaGFiZXRpY2FsIG9yZGVyCiAgICBnZ3Bsb3QoYWVzKHR5cGUsY291bnQpKSArIGdlb21fY29sKCkgKyBjb29yZF9mbGlwKCkKYGBgCgpUaGlzIGlzIGEgZ29vZCBjaGFuY2UgdG8gY2hlY2sgd2hldGhlciB0aGUgZnJlcXVlbmNpZXMgb2YgdGhlIHdvcmQgdHlwZXMgbWF0Y2hlcyBhIFppcGYgZGlzdHJpYnV0aW9uIGNvbW1vbmx5IGFzc29jaWF0ZWQgd2l0aCB0ZXh0LgoKQSBaaXBmIGRpc3RyaWJ1dGlvbiBpcyBjaGFyYWN0ZXJpemVkIGJ5IGEgcG93ZXIgbGF3OiAgdGhlIGZyZXF1ZW5jeSBvZiB3b3JkIHR5cGVzIGlzIGludmVyc2VseSBwcm9wb3J0aW9uYWwgdG8gIHJhbmssICRmX2sgXHByb3B0byAxL2skLiAgU2FpZCBkaWZmZXJlbnRseSwgdGhlIGZyZXF1ZW5jeSBvZiB0aGUgc2Vjb25kIG1vc3QgY29tbW9uIHdvcmQgaXMgaGFsZiB0aGF0IG9mIHRoZSBtb3N0IGNvbW1vbiwgdGhlIGZyZXF1ZW5jeSBvZiB0aGUgdGhpcmQgaXMgb25lLXRoaXJkIHRoZSBtb3N0IGNvbW1vbiwgZXRjLiAgQSBsaXR0bGUgYWxnZWJyYSBzaG93cyB0aGF0IGZvciB0aGlzIHRvIG9jY3VyLCB0aGVuICRcbG9nIHBfayBcYXBwcm94IGJfMCAtIFxsb2cgayQuICBUaGF0IGlzLCBhIHBsb3Qgb2YgdGhlIGxvZyBvZiB0aGUgZnJlcXVlbmNpZXMgc2hvdWxkIGJlIGxpbmVhciBpbiB0aGUgbG9nIG9mIHRoZSByYW5rICRrJCwgd2l0aCBzbG9wZSBuZWFyIC0xLgoKYGBge3J9CkZyZXEgJT4lIAogICAgYXJyYW5nZShkZXNjKGNvdW50KSkgICAgICAgICAgICAgICAgICAgJT4lICAgIyBkZWNyZWFzaW5nIGJ5IGNvdW50CiAgICBtdXRhdGUocmFuaz1yb3dfbnVtYmVyKCkpICAgICAgICAgICAgICAlPiUgICAjIGFkZCByb3cgbnVtYmVyCiAgICBnZ3Bsb3QoYWVzKHg9bG9nKHJhbmspLCB5PWxvZyhjb3VudCkpKSArIAogICAgZ2VvbV9wb2ludCgpICsKICAgICMgZ2VvbV9zbW9vdGgobWV0aG9kPSdsbScsIHNlPUZBTFNFKSArCiAgICBnZW9tX2FibGluZShzbG9wZT0tMSwgaW50ZXJjZXB0PTExLCBjb2xvcj0ncmVkJykKYGBgCgpUaGUgbGVhc3Qgc3F1YXJlcyBzbG9wZSAoY29tbWVudGVkIG91dCBvciBzaG93biBpbiBibHVlKSBpcyBzdGVlcGVyLCBiZWluZyBkb21pbmF0ZWQgYnkgdGhlIG1hbnkgbGVzcyBjb21tb24gd29yZHMuICBZb3UgY2FuIG1pdGlnYXRlIHRoYXQgZWZmZWN0IGJ5IHdlaWdodGluZyB0aGUgcmVncmVzc2lvbiBieSB0aGUgY291bnRzLgoKYGBge3J9ClRlbXAgPC0gRnJlcSAlPiUgIG11dGF0ZShyYW5rPXJvd19udW1iZXIoKSkKbG0obG9nKGNvdW50KSB+IGxvZyhyYW5rKSwgZGF0YT1UZW1wLCB3ZWlnaHRzPXNxcnQoY291bnQpKQpgYGAKCgojIFdvcmRzIGFuZCBwcmljZXMKCkJlZm9yZSBjb25jbHVkaW5nIHRoaXMgc2hvcnQgaW50cm9kdWN0aW9uLCBsZXRzIHJlbGF0ZSB0aGUgd29yZCB0eXBlcyB0byBwcmljZXMuIFdoaWNoIHdvcmQgdHlwZXMgYXJlIGFzc29jaWF0ZWQgd2l0aCBwcmljeSB3aW5lcywgYW5kIHdoaWNoIHdpdGggY2hlYXBlciB3aW5lcz8gIAoKVG8gZmluZCBvdXQsIGNvbWJpbmUgdGhlIGluZm9ybWF0aW9uIGluIHRoZSBkb2N1bWVudCB0ZXJtIG1hdHJpeCB3aXRoIHByaWNlcyBmcm9tIHRoZSBgV2luZWAgZGF0YSBmcmFtZS4gIEZpcnN0IGZpeCB0aGF0IHdlaXJkbyBwcmljZSBmb3VuZCB1c2luZyBKTVAgZWFybGllci4KCmBgYHtyfQptYXgoV2luZSRwcmljZSwgbmEucm09VFJVRSkgICAgICAgICMgbWlzc2luZyB2YWx1ZXMgYXJlIGNvbnRhZ2VvdXMgaW4gUgppIDwtIHdoaWNoLm1heChXaW5lJHByaWNlKSAgICAgICAgICMgaGFuZGxlcyB0aGUgTkEgYnkgZGVmYXVsdAppCldpbmUkcHJpY2VbaV0gPC0gTkEKbWF4KFdpbmUkcHJpY2UsIG5hLnJtPVRSVUUpCmBgYAoKVG8ga2VlcCB0aGluZ3MgbWFuYWdlYWJsZSwgY29uc2lkZXIgd29yZHMgdGhhdCBhcHBlYXIsIHNheSwgYXQgbGVhc3QgMTAgdGltZXMgaW4gdGhlIGNvcnB1cy4gIFRoYXQncyBzdGlsbCAxLDc3NiB0eXBlcy4gIFRoZSBtYXRyaXggYGNvdW50c2AgaGFzIHRoZSBjb3VudHMgZm9yIGVhY2ggd29yZCB0eXBlLiAgV2UgY2FuIGZpbmQgdGhlIHdvcmQgd2l0aCB0aGUgaGlnaGVzdCBhdmVyYWdlIHByaWNlIGJ5IHR1cm5pbmcgdGhlc2UgaW50byAwLzEgaW5kaWNhdG9ycyBhbmQgdXNpbmcgYSBtYXRyaXggcHJvZHVjdC4KCmBgYHtyfQpjb3VudHMgPC0gYXMubWF0cml4KGR0bVssOTxtal0pCmRpbShjb3VudHMpCgpjb3VudC5uYW1lcyA8LSBjb2xuYW1lcyhjb3VudHMpICAgIyBzYXZlIGZvciBsYXRlcgpjb3VudC5uYW1lc1sxOjEwXQpgYGAKClRoZSB3b3JkICJhbmQiIGFwcGVhcnMgbWFueSB0aW1lcywgc28gdHVybiB0aGVzZSBpbnRlZ2VycyBpbnRvIGluZGljYXRvcnMgKDEgaWYgdGhlIHdvcmQgdHlwZSBhcHBlYXJzIGluIGEgbm90ZSAgYXQgbGVhc3Qgb25jZSBhbmQgMCBpZiBub3QgcHJlc2VudCkuCgpgYGB7cn0KY291bnRzWzE6MTBdCmBgYAoKYHBtaW5gIGRvZXMgZWxlbWVudC1ieS1lbGVtZW50IGNvbXBhcmlzb24sIGJ1dCBjb252ZXJ0cyB0aGUgbWF0cml4IGludG8gYSB2ZWN0b3IuICBTbywgdHVybiB0aGUgdmVjdG9yIG9mIDAvMXMgYmFjayBpbnRvIGEgbWF0cml4LgoKYGBge3J9CmNvdW50cyA8LSBhcy5tYXRyaXgocG1pbihjb3VudHMsMSksbnJvdz0yMDUwOCkKCmNvdW50c1sxOjEwXQpgYGAKCkEgcXVpY2sgY2hlY2sgb2YgdGhlIHByaWNlcywgaWRlbnRpZnlpbmcgdGhvc2Ugbm90IG1pc3NpbmcuCgpgYGB7cn0Kbm90Lm1pc3NpbmcgPC0gISBpcy5uYShXaW5lJHByaWNlKQptaW4oV2luZSRwcmljZVtub3QubWlzc2luZ10pICAgICAgICAjIFR3by1idWNrIGNodWNrPwpgYGAKCkZpbmFsbHksIGZpbmQgdGhlIGF2ZXJhZ2UgcHJpY2UuIApgYGB7cn0KYXZnLnByaWNlIDwtIChXaW5lJHByaWNlW25vdC5taXNzaW5nXSAlKiUgY291bnRzW25vdC5taXNzaW5nLF0pL2NvbFN1bXMoY291bnRzW25vdC5taXNzaW5nLF0pCmBgYAoKTm90ZSB0aGF0IHNvbWUgYXZlcmFnZXMgYXJlICdOYU4nIGJlY2F1c2UgYSB3b3JkIHR5cGUgZG9lcyBub3QgYXBwZWFyIGluIHdpbmVzIHdpdGgga25vd24gcHJpY2VzLiBGb3IgZXhhbXBsZSwgdGhlIHdvcmQgYHByb21pc2luZ2AgaXMgY29tbW9uLCBidXQgbm90IGZvdW5kIGluIHRoZSBkZXNjcmlwdGlvbnMgb2Ygd2luZXMgd2l0aCBrbm93biBwcmljZS4KCmBgYHtyfQptaW4oYXZnLnByaWNlKQpgYGAKCldoaWNoIHdvcmRzIGNvbWUgd2l0aCB0aGUgaGlnaCBwcmljZXM/CgpgYGB7cn0KbmFtZXMoYXZnLnByaWNlKSA8LSBjb3VudC5uYW1lcwpzb3J0KGF2Zy5wcmljZSwgZGVjcmVhc2luZz1UUlVFKVsxOjIwXQpgYGAKCkFuZCB3aGljaCB3aXRoIGxvdyBwcmljZXM/ICAKCmBgYHtyfQpzb3J0KGF2Zy5wcmljZSlbMToyMF0KYGBgCgoKIyBOZXh0IHRpbWUKClRoZXNlIG5vdGVzIGNvbnRpbnVlIHdpdGggdGhpcyBhbmFseXNpcywgY29uc2lkZXJpbmcgdGhlIHNpbmd1bGFyIHZhbHVlIGRlY29tcG9zaXRpb24gb2YgdGhlIGNvdW50cyBpbiB0aGUgZG9jdW1lbnQgdGVybSBtYXRyaXguCgo=

Text as Data: Fundamentals

Robert Stine

July 2017

Setup R

Building a document corpus

Document term matrix

Words and prices

Next time