Preprocessing

First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.

library(gensimr)

set.seed(42) # rerproducability

# sample data
data(corpus, package = "gensimr")
print(corpus)
#> [1] "Human machine interface for lab abc computer applications"    
#> [2] "A survey of user opinion of computer system response time"    
#> [3] "The EPS user interface management system"                     
#> [4] "System and human system engineering testing of EPS"           
#> [5] "Relation of user perceived response time to error measurement"
#> [6] "The generation of random binary unordered trees"              
#> [7] "The intersection graph of paths in trees"                     
#> [8] "Graph minors IV Widths of trees and well quasi ordering"      
#> [9] "Graph minors A survey"

# preprocess corpus
docs <- prepare_documents(corpus)
#> → Preprocessing 9 documents
#> ← 9 documents after perprocessing

docs[[1]] # print first preprocessed document 
#> [[1]]
#> [1] "human"
#> 
#> [[2]]
#> [1] "interface"
#> 
#> [[3]]
#> [1] "computer"

Once preprocessed we can build a dictionary.

dictionary <- corpora_dictionary(docs)

A dictionary essentially assigns an integer to each term.

doc2bow simply applies the method of the same name to every documents (see example below); it counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

# native method to a single document
dictionary$doc2bow(docs[[1]])
#> [(0, 1), (1, 1), (2, 1)]

# apply to all documents
corpus_bow <- doc2bow(dictionary, docs)

Then serialise to matrix market format, the function returns the path to the file (this is saved on disk for efficiency), if no path is passed then a temp file is created. Here we set auto_delete to FALSE otherwise the corpus is deleted after first use. Note this means you should manually delete it with delete_mmcorpus.

(corpus_mm <- serialize_mmcorpus(corpus_bow, auto_delete = FALSE))
#> ℹ Path: /var/folders/n9/ys9t1h091jq80g4hww24v8g0n7v578/T//RtmplbyYz9/file250f1dd646b9.mm 
#>  ✔ Temp file
#>  ✖ Delete after use

Then initialise a model, we’re going to use a Latent Similarity Indexing method later on (model_lsi) which requires td-idf.

tfidf <- model_tfidf(corpus_mm)

We can then use the model to transform our original corpus.

corpus_transformed <- wrap(tfidf, corpus_bow)

Hierarchical Dirichlet Process

hdp <- sklearn_hdp(id2word = dictionary)
vectors <- hdp$fit_transform(corpus_bow)

Latent Semantic Indexing

Create stages for our pipeline (including gensim and sklearn models alike).

lsi <- sklearn_lsi(id2word = dictionary, num_topics = 15L)

# L2 reg classifier
clf <- sklearn_logistic(penalty = "l2", C = 0.1, solver = "lbfgs")

# sklearn pipepline
pipe <- sklearn_pipeline(lsi, clf)

# Create some random binary labels for our documents.
labels <- sample(c(0L, 1L), 9, replace = TRUE)

# How well does our pipeline perform on the training set?
pipe$fit(corpus_bow, labels)$score(corpus_bow, labels)
#> 0.7777777777777778

Word ID Mapping

doc2bow with scikit-learn. Note that in the example below we do not clean the text (no preprocess).

# initialise
skbow_model <- sklearn_doc2bow()

# fit
corpus_skbow <- skbow_model$fit_transform(corpus)