prepare_documents.Rd
Simple text preprocessor for, namely for example purposes.
prepare_documents(data, ...) # S3 method for data.frame prepare_documents(data, text, doc_id = NULL, min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ..., return_doc_id = FALSE) # S3 method for character prepare_documents(data, doc_id = NULL, min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ..., return_doc_id = FALSE) # S3 method for factor prepare_documents(data, doc_id = NULL, min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ..., return_doc_id = FALSE)
data | A |
---|---|
... | Any other parameters. |
text | A bare column name or a vector of documents. |
doc_id | Id of documents, if omitted they are created dynamically
assuming each element of |
min_freq | Minimum term frequency to keep terms in. |
lexicon | Name of a lexicon of stopwords, borrowed from stop_words. |
return_doc_id | Whether to return document id (named list). |
A named list
of documents where the names are the documents id
.
Simply tokenises each document, removes punctuation, stop words, digits,
and keeps only terms that appear more than min_freq
across documents.