Note that there is no universal way to assess the best number of topics (
num_topics) to fit a set of document, see this post.
As stated in table 2 from this paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. Therefore a process to assess the best number of topics to apply to a corpus should return
library(gensimr) data("corpus", package = "gensimr") texts <- prepare_documents(corpus) #> → Preprocessing 9 documents #> ← 9 documents after perprocessing dictionary <- corpora_dictionary(texts) corpus_bow <- doc2bow(dictionary, texts) tfidf <- model_tfidf(corpus_bow, id2word = dictionary) corpus_tfidf <- wrap(tfidf, corpus_bow)
We can run multiple Latent Dirichlet Allocation models given different number of topics then assess which is best using the perplexity score.
models <- map_model( num_topics = c(2, 4, 8, 10, 12), corpus = corpus_tfidf, id2word = dictionary ) plot(models)