Reproducing tutorial on similarity.
First preprocess the corpus.
library(gensimr) set.seed(42) # rerproducability # sample data data(corpus, package = "gensimr") print(corpus) #>  "Human machine interface for lab abc computer applications" #>  "A survey of user opinion of computer system response time" #>  "The EPS user interface management system" #>  "System and human system engineering testing of EPS" #>  "Relation of user perceived response time to error measurement" #>  "The generation of random binary unordered trees" #>  "The intersection graph of paths in trees" #>  "Graph minors IV Widths of trees and well quasi ordering" #>  "Graph minors A survey" # preprocess corpus docs <- prepare_documents(corpus) #> → Preprocessing 9 documents #> ← 9 documents after perprocessing docs[] # print first preprocessed document #> [] #>  "human" #> #> [] #>  "interface" #> #> [] #>  "computer"
Once preprocessed we can build a dictionary.
A dictionary essentially assigns an integer to each term.
doc2bow simply applies the method of the same name to every documents (see example below); it counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.
# native method to a single document dictionary$doc2bow(docs[]) #> [(0, 1), (1, 1), (2, 1)] # apply to all documents corpus_bow <- doc2bow(dictionary, docs)
Then serialise to matrix market format, the function returns the path to the file (this is saved on disk for efficiency), if no path is passed then a temp file is created. Here we set
FALSE otherwise the corpus is deleted after first use. Note this means you should manually delete it with
(corpus_mm <- serialize_mmcorpus(corpus_bow, auto_delete = FALSE)) #> ℹ Path: /var/folders/n9/ys9t1h091jq80g4hww24v8g0n7v578/T//Rtmpl238UF/file440237cac478.mm #> ✔ Temp file #> ✖ Delete after use
Then initialise a model, we’re going to use a Latent Similarity Indexing method later on (
model_lsi) which requires td-idf.
We can then use the model to transform our original corpus.
Now we can compare similarity between our preprocessed corpus and a new document.
lsi <- model_lsi(corpus_transformed, id2word = dictionary) mm <- read_serialized_mmcorpus(corpus_mm) new_document <- "A human and computer interaction" preprocessed_new_document <- preprocess(new_document, min_freq = 0) vec_bow <- doc2bow(dictionary, preprocessed_new_document) vec_lsi <- wrap(lsi, vec_bow) wrapped_lsi <- wrap(lsi, mm) index <- similarity_matrix(wrapped_lsi) sims <- wrap(index, vec_lsi) get_similarity(sims) #> # A tibble: 9 x 2 #> doc cosine #> <dbl> <dbl> #> 1 0 6.02e- 1 #> 2 3 4.32e- 1 #> 3 8 1.12e- 8 #> 4 5 0. #> 5 6 -9.31e-10 #> 6 7 -7.45e- 9 #> 7 1 -2.39e- 2 #> 8 2 -2.93e- 2 #> 9 4 -3.62e- 2
You can also compare documents in the corpora with one another. The method is slightly different to improve computational efficiency. Note that we set the number of features to the number of words in the dictionary.
The visualisation (matrix of cosine similarities) reveals the two clusters of documents again: as stated in table 2 from this paper, the example corpus (
data(corpus)) essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs.
# build model index2 <- similarity(corpus_mm, num_features = reticulate::py_len(dictionary)) # query all similarities sims <- wrap(index2, corpus_bow, to_r = TRUE) sims_long <- reshape2::melt(sims) library(ggplot2) #> Warning: package 'ggplot2' was built under R version 3.5.2 sims_long %>% dplyr::mutate_at(dplyr::vars(c("Var1", "Var2")), as.factor) %>% ggplot(aes(Var1, Var2)) + geom_tile(aes(fill = value)) + theme( panel.background = element_rect(fill = "#f4f1e6"), plot.background = element_rect(fill = "#f4f1e6"), legend.background = element_rect(fill = "#f4f1e6") ) + xlab("Document") + ylab("Document")
Clean up, delete the corpus.