Preprocessing

First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.

library(gensimr)

set.seed(42) # rerproducability

# sample data
data(corpus, package = "gensimr")

# preprocess corpus
docs <- prepare_documents(corpus)
#> → Preprocessing 9 documents
#> ← 9 documents after perprocessing

Phrases

Automatically detect common phrases – multi-word expressions / word n-grams – from a stream of sentences.

Here we use and example dataset. The idea is that it is saved to a file (on disk) thereby allowing gensim to stream its content which is much more efficient than loading everything in memory before runnig the model.

Let’s look at the content of the example file.

file <- datapath('testcorpus.txt') # example dataset
readLines(file) # just to show you what it looks like
#> [1] "computer human interface"                 
#> [2] "computer response survey system time user"
#> [3] "interface system user eps"                
#> [4] "human system system eps"                  
#> [5] "response time user"                       
#> [6] "trees"                                    
#> [7] "trees graph"                              
#> [8] "trees graph minors"                       
#> [9] "survey graph minors"

We observe that it is very similar to the output of prepare_documents(corpus) (the docs) object in this document. We can now scan the file to build a corpus with text8corpus

sentences <- text8corpus(file)
phrases <- phrases(docs, min_count = 1L, threshold = 1L)

That simple, now we can apply the model to new sentences.

sentence <- list('trees', 'graph', 'minors')
wrap(phrases, sentence)
#> ['trees', 'graph_minors']

We can add vocabulary to an already trained model with.

phrases$add_vocab(list(list("hello", "world"), list("meow")))
#> None

We can create a faster model with.

bigram <- phraser(phrases)
wrap(bigram, sentence)
#> ['trees', 'graph_minors']