Preprocessing

First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.

library(gensimr)

set.seed(42) # rerproducability

# sample data
data(corpus, package = "gensimr")
print(corpus)
#> [1] "Human machine interface for lab abc computer applications"    
#> [2] "A survey of user opinion of computer system response time"    
#> [3] "The EPS user interface management system"                     
#> [4] "System and human system engineering testing of EPS"           
#> [5] "Relation of user perceived response time to error measurement"
#> [6] "The generation of random binary unordered trees"              
#> [7] "The intersection graph of paths in trees"                     
#> [8] "Graph minors IV Widths of trees and well quasi ordering"      
#> [9] "Graph minors A survey"

# preprocess corpus
docs <- prepare_documents(corpus)
#> → Preprocessing 9 documents
#> ← 9 documents after perprocessing

This produces the same output as the built-in prepared documents.

common_texts()
#> [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]

Preprocess

The following are methods that work on lists, character vectors and data.frames.

preprocessed <- preprocess(corpus)
preprocessed[[1]]
#> [1] "human"    "machin"   "interfac" "lab"      "abc"      "applic"

By default, the function preprocess applies the following:

strip_tags
strip_punctuation
strip_multiple_spaces
strip_numeric
remove_stopwords
strip_short
stem_text

preprocessed <- preprocess(corpus, filters = c("strip_tags", "strip_punctuation", "strip_multiple_spaces", "strip_numeric",
    "remove_stopwords"))
preprocessed[[1]]
#> [1] "human"        "machine"      "interface"    "lab"         
#> [5] "abc"          "applications"

Remove Stopwords

Remove stopwords.

remove_stopwords(corpus[[1]])
#> [1] "Human machine interface lab abc applications"

Strip Short

Remove short words.

remove_stopwords(corpus[[2]], min_len = 3)
#> [1] "A survey user opinion response time"

Split Alphanumerics

split_alphanum("24.0hours7 days365 a1b2c3")
#> [1] "24.0 hours 7 days 365 a 1 b 2 c 3"

Strip Punctuation

Replaces punctuation with space.

strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
#> [1] "A semicolon is a stronger break than a comma  but not as much as a full stop "

Strip Tags

Removes tags.

strip_tags("<i>Hello</i> <b>World</b>!")
#> [1] "Hello World!"

Strip Numerics

Removes digits.

strip_numeric("0text24gensim365test")
#> [1] "textgensimtest"

Strip Non-alphabetics

Removes non-alphabetic characters.

strip_non_alphanum("if-you#can%read$this&then@this#method^works")
#> [1] "if you can read this then this method works"

Strip Multiple Spaces

Remove repeating whitespace characters (spaces, tabs, line breaks) from s and turns tabs & line breaks into spaces.

strip_multiple_spaces(paste0("salut", '\r', " les", '\n', "         loulous!"))
#> [1] "salut les loulous!"

Stem

Transform to lowercase and stem.

stem_text("It is useful to be able to search a large collection of documents almost instantly.")
#> [1] "it is us to be abl to search a larg collect of document almost instantly."

Porter Stemmer

stemmer <- porter_stemmer()
stemmer$stem_sentence("Cats and ponies have meeting")
#> cat and poni have meet
stemmer$stem_documents(c("Cats and ponies", "have meeting"))
#> ['cat and poni', 'have meet']