Simple text preprocessor for, namely for example purposes.

prepare_documents(data, ...)

# S3 method for data.frame
prepare_documents(data, text, doc_id = NULL,
  min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ...,
  return_doc_id = FALSE)

# S3 method for character
prepare_documents(data, doc_id = NULL,
  min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ...,
  return_doc_id = FALSE)

# S3 method for factor
prepare_documents(data, doc_id = NULL, min_freq = 1,
  lexicon = c("SMART", "snowball", "onix"), ..., return_doc_id = FALSE)

Arguments

data

A data.frame containing text and id where each row represent a document or a character vector of text containing documents.

...

Any other parameters.

text

A bare column name or a vector of documents.

doc_id

Id of documents, if omitted they are created dynamically assuming each element of text.

min_freq

Minimum term frequency to keep terms in.

lexicon

Name of a lexicon of stopwords, borrowed from stop_words.

return_doc_id

Whether to return document id (named list).

Value

A named list of documents where the names are the documents id.

Details

Simply tokenises each document, removes punctuation, stop words, digits, and keeps only terms that appear more than min_freq across documents.