hyperparameters.Rmd
Note that there is no universal way to assess the best number of topics (num_topics
) to fit a set of document, see this post.
As stated in table 2 from this paper, this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. Therefore a process to assess the best number of topics to apply to a corpus should return 2
.
library(gensimr)
data("corpus", package = "gensimr")
texts <- prepare_documents(corpus)
#> → Preprocessing 9 documents
#> ← 9 documents after perprocessing
dictionary <- corpora_dictionary(texts)
corpus_bow <- doc2bow(dictionary, texts)
tfidf <- model_tfidf(corpus_bow, id2word = dictionary)
corpus_tfidf <- wrap(tfidf, corpus_bow)
We can run multiple Latent Dirichlet Allocation models given different number of topics then assess which is best using the perplexity score.
models <- map_model(
num_topics = c(2, 4, 8, 10, 12),
corpus = corpus_tfidf,
id2word = dictionary
)
plot(models)
get_perplexity_data(models)
#> # A tibble: 5 x 3
#> num_topics perplexity model
#> <int> <dbl> <list>
#> 1 2 -3.51 <gns...LM>
#> 2 4 -4.25 <gns...LM>
#> 3 8 -5.18 <gns...LM>
#> 4 10 -5.44 <gns...LM>
#> 5 12 -5.71 <gns...LM>