Initialise a model based on the document frequencies of all its features.

model_tfidf(mm, normalize = FALSE, smart = "nfc", pivot = NULL,
  slope = 0.25, ...)

# S3 method for mm_file
model_tfidf(mm, normalize = FALSE, smart = "nfc",
  pivot = NULL, slope = 0.25, ...)

# S3 method for mm
model_tfidf(mm, normalize = FALSE, smart = "nfc",
  pivot = NULL, slope = 0.25, ...)

# S3 method for python.builtin.list
model_tfidf(mm, normalize = FALSE,
  smart = "nfc", pivot = NULL, slope = 0.25, ...)

# S3 method for python.builtin.tuple
model_tfidf(mm, normalize = FALSE,
  smart = "nfc", pivot = NULL, slope = 0.25, ...)

load_tfidf(file)

Arguments

mm

A matrix market as returned by mmcorpus_serialize.

normalize

ormalize document vectors to unit euclidean length? You can also inject your own function into normalize.

smart

SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example nfc, bpn and so on, where the letters represents the term weighting of the document vector. See SMART section.

pivot

You can either set the pivot by hand, or you can let Gensim figure it out automatically with the following two steps: 1) Set either the u or b document normalization in the smartirs parameter. 2) Set either the corpus or dictionary parameter. The pivot will be automatically determined from the properties of the corpus or dictionary. If pivot is NULL and you don’t follow steps 1 and 2, then pivoted document length normalization will be disabled.

slope

Setting the slope to 0.0 uses only the pivot as the norm, and setting the slope to 1.0 effectively disables pivoted document length normalization. Singhal [2] suggests setting the slope between 0.2 and 0.3 for best results.

...

Any other options, from the official documentation.

file

Path to a saved model.

SMART

Term frequency weighing:

  • b - binary

  • t or n - raw

  • a - augmented

  • l - logarithm

  • d - double logarithm

  • L - Log average

Document frequency weighting:

  • x or n - none

  • f - idf

  • t - zero-corrected idf

  • p - probabilistic idf

Document normalization:

  • x or n - none

  • c - cosine

  • u - pivoted unique

  • b - pivoted character length

Examples

docs <- prepare_documents(corpus)
#> Preprocessing 9 documents #> 9 documents after perprocessing
dictionary <- corpora_dictionary(docs) corpora <- doc2bow(dictionary, docs) # fit model tfidf <- model_tfidf(corpora)