Topic Modeling for Humans with gensim.
Large scale efficient topic modeling in R and Python.
Below we build a very basic Latent Dirichlet Allocation model aiming for 2 latent dimensions using example data.
library(gensimr)
# example corpus
data("corpus", package = "gensimr")
# preprocess documents
texts <- prepare_documents(corpus)
#> → Preprocessing 9 documents
#> ← 9 documents after perprocessing
dictionary <- corpora_dictionary(texts)
corpus <- doc2bow(dictionary, texts)
# create tf-idf model
tfidf <- model_tfidf(corpus)
tfidf_corpus <- wrap(tfidf, corpus)
# latent similarity index
lda <- model_lda(tfidf_corpus, id2word = dictionary, num_topics = 2L)
topics <- lda$print_topics() # get topics
Objects returned by the package are not automatically converted to R data structures, use reticulate::py_to_r
as shown below to convert them.
reticulate::py_to_r(topics) # convert to R format
#> [[1]]
#> [[1]][[1]]
#> [1] 0
#>
#> [[1]][[2]]
#> [1] "0.145*\"trees\" + 0.112*\"graph\" + 0.086*\"minors\" + 0.085*\"response\" + 0.082*\"computer\" + 0.081*\"time\" + 0.081*\"survey\" + 0.081*\"system\" + 0.076*\"user\" + 0.062*\"human\""
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 1
#>
#> [[2]][[2]]
#> [1] "0.113*\"interface\" + 0.102*\"system\" + 0.099*\"eps\" + 0.094*\"human\" + 0.085*\"user\" + 0.083*\"minors\" + 0.081*\"trees\" + 0.079*\"graph\" + 0.069*\"survey\" + 0.068*\"time\""
We can then use our model to transform our corpus and then the document topic matrix.
corpus_wrapped <- wrap(lda, corpus)
doc_topics <- get_docs_topics(corpus_wrapped)
plot(doc_topics$dimension_1_y, doc_topics$dimension_2_y)
The plot correctly identifies two topics/clusters. As stated in table 2 from this paper, the example corpus (data(corpus)
) essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs.