First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.
library(gensimr) set.seed(42) # rerproducability # sample data data(corpus, package = "gensimr") # preprocess corpus docs <- prepare_documents(corpus) #> → Preprocessing 9 documents #> ← 9 documents after perprocessing
Word2vec works somewhat differently. The example below is a reproduction of the Kaggle Gensim Word2Vec Tutorial.
# initialise word2vec <- model_word2vec(size = 100L, window = 5L, min_count = 1L) word2vec$build_vocab(docs) #> None word2vec$train(docs, total_examples = word2vec$corpus_count, epochs = 20L) #> (76, 580) word2vec$init_sims(replace = TRUE) #> None
Now we can explore the model.
word2vec$wv$most_similar(positive = c("interface")) #> [('computer', 0.23181433975696564), ('graph', 0.11893773078918457), ('minors', 0.09199836105108261), ('eps', 0.06503799557685852), ('user', 0.04753843694925308), ('time', -0.008810970932245255), ('system', -0.011411845684051514), ('response', -0.01997048407793045), ('human', -0.029993511736392975), ('survey', -0.052159011363983154)]
We expect “trees” to be the odd one out, it is a term that was in a different topic (#2) whereas other terms were in topics #1.
word2vec$wv$doesnt_match(c("human", "interface", "trees")) #> interface
Test similarity between words.
word2vec$wv$similarity("human", "trees") #> 0.024661217 word2vec$wv$similarity("eps", "survey") #> -0.10218239
Automatically detect common phrases – multi-word expressions / word n-grams – from a stream of sentences.
Here we use and example dataset. The idea is that it is saved to a file (on disk) thereby allowing gensim to stream its content which is much more efficient than loading everything in memory before runnig the model.
Let’s look at the content of the example file.
file <- datapath('testcorpus.txt') # example dataset readLines(file) # just to show you what it looks like #>  "computer human interface" #>  "computer response survey system time user" #>  "interface system user eps" #>  "human system system eps" #>  "response time user" #>  "trees" #>  "trees graph" #>  "trees graph minors" #>  "survey graph minors"
We observe that it is very similar to the output of
docs) object in this document. We can now scan the file to build a corpus with
sentences <- text8corpus(file) phrases <- phrases(docs, min_count = 1L, threshold = 1L)
That simple, now we can apply the model to new sentences.
sentence <- list('trees', 'graph', 'minors') wrap(phrases, sentence) #> ['trees', 'graph_minors']
We can add vocabulary to an already trained model with.
phrases$add_vocab(list(list("hello", "world"), list("meow"))) #> None
We can create a faster model with.
bigram <- phraser(phrases) wrap(bigram, sentence) #> ['trees', 'graph_minors']