Prepare Documents

Simple text preprocessor for, namely for example purposes.

prepare_documents(data, ...)

# S3 method for data.frame
prepare_documents(data, text, doc_id = NULL,
  min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ...,
  return_doc_id = FALSE)

# S3 method for character
prepare_documents(data, doc_id = NULL,
  min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ...,
  return_doc_id = FALSE)

# S3 method for factor
prepare_documents(data, doc_id = NULL, min_freq = 1,
  lexicon = c("SMART", "snowball", "onix"), ..., return_doc_id = FALSE)

Arguments

data	A `data.frame` containing `text` and `id` where each row represent a document or a `character` vector of text containing documents.
...	Any other parameters.
text	A bare column name or a vector of documents.
doc_id	Id of documents, if omitted they are created dynamically assuming each element of `text`.
min_freq	Minimum term frequency to keep terms in.
lexicon	Name of a lexicon of stopwords, borrowed from stop_words.
return_doc_id	Whether to return document id (named list).

Value

A named list of documents where the names are the documents id.

Details

Simply tokenises each document, removes punctuation, stop words, digits, and keeps only terms that appear more than min_freq across documents.

Arguments

Value

Details

Contents