preprocessing.Rmd
First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.
library(gensimr)
set.seed(42) # rerproducability
# sample data
data(corpus, package = "gensimr")
print(corpus)
#> [1] "Human machine interface for lab abc computer applications"
#> [2] "A survey of user opinion of computer system response time"
#> [3] "The EPS user interface management system"
#> [4] "System and human system engineering testing of EPS"
#> [5] "Relation of user perceived response time to error measurement"
#> [6] "The generation of random binary unordered trees"
#> [7] "The intersection graph of paths in trees"
#> [8] "Graph minors IV Widths of trees and well quasi ordering"
#> [9] "Graph minors A survey"
# preprocess corpus
docs <- prepare_documents(corpus)
#> → Preprocessing 9 documents
#> ← 9 documents after perprocessing
This produces the same output as the built-in prepared documents.
common_texts()
#> [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
The following are methods that work on lists, character vectors and data.frames.
preprocessed <- preprocess(corpus)
preprocessed[[1]]
#> [1] "human" "machin" "interfac" "lab" "abc" "applic"
By default, the function preprocess
applies the following:
strip_tags
strip_punctuation
strip_multiple_spaces
strip_numeric
remove_stopwords
strip_short
stem_text
preprocessed <- preprocess(corpus, filters = c("strip_tags", "strip_punctuation", "strip_multiple_spaces", "strip_numeric",
"remove_stopwords"))
preprocessed[[1]]
#> [1] "human" "machine" "interface" "lab"
#> [5] "abc" "applications"
Remove stopwords.
remove_stopwords(corpus[[1]])
#> [1] "Human machine interface lab abc applications"
Remove short words.
remove_stopwords(corpus[[2]], min_len = 3)
#> [1] "A survey user opinion response time"
split_alphanum("24.0hours7 days365 a1b2c3")
#> [1] "24.0 hours 7 days 365 a 1 b 2 c 3"
Replaces punctuation with space.
strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
#> [1] "A semicolon is a stronger break than a comma but not as much as a full stop "
Removes non-alphabetic characters.
strip_non_alphanum("if-you#can%read$this&then@this#method^works")
#> [1] "if you can read this then this method works"
Remove repeating whitespace characters (spaces, tabs, line breaks) from s and turns tabs & line breaks into spaces.
strip_multiple_spaces(paste0("salut", '\r', " les", '\n', " loulous!"))
#> [1] "salut les loulous!"