Learn R Programming

textTools (version 0.1.0)

Functions for Text Cleansing and Text Analysis

Description

A framework for text cleansing and analysis. Conveniently prepare and process large amounts of text for analysis. Includes various metrics for word counts/frequencies that scale efficiently. Quickly analyze large amounts of text data using a text.table (a data.table created with one word (or unit of text analysis) per row, similar to the tidytext format). Offers flexibility to efficiently work with text data stored in vectors as well as text data formatted as a text.table.

Copy Link

Version

Install

install.packages('textTools')

Version

0.1.0

License

GPL (>= 2)

Maintainer

Timothy Conwell

Last Published

February 5th, 2021

Functions in textTools (0.1.0)

regex_sentence

Regular expression that might be used to split strings of text into component sentences.
rm_infrequent_words

Delete rows in a text.table where the number of identical records within a group is less than a certain threshold
str_counts

Create a list of a vector of unique words found in x and a vector of the counts of each word in x.
rm_long_words

Delete rows in a text.table where the word has more than a minimum number of characters
str_extract_positional_match

Extract words from a vector that are found in the same position in another vector.
str_dt_col_combine

Combine columns of a data.table into a list in a new column, wraps list(unlist(c(...)))
flag_words

Flag rows in a text.table with specific words
str_count_positional_match

Count words from a vector that are found in the same position in another vector.
rm_overlap

Delete rows in a text.table where the records within a group are also found in other groups (overlapping records)
str_count_nomatch

Count the words in a vector that are not found in another vector.
rm_no_overlap

Delete rows in a text.table where the records within a group are not also found in other groups (overlapping records)
sampleStr

Generates (pseudo)random strings of the specified char length
stopwords

Vector of lowercase English stop words.
rm_regexp_match

Delete rows in a text.table where the record has a certain pattern indicated by a regular expression
str_count_jaccard_similarity

Calculates the intersect divided by union of two vectors of words.
str_count_positional_nomatch

Count words from a vector that are not found in the same position in another vector.
rm_parts_of_speech

Delete rows in a text.table where the word has a certain part of speech
str_rm_blank_space

Remove and replace excess white space from strings.
rm_words

Remove rows from a text.table with specific words
str_rm_long_words

Remove words from a vector that have more than a maximum number of characters.
str_rm_non_alphanumeric

Remove and replace non-alphanumeric characters from strings.
str_weighted_count_match

Weighted count of the words in a vector that are found in another vector.
str_count_match

Count the words in a vector that are found in another vector.
str_rm_words

Remove words from a vector of words found in another vector of words.
str_any_match

Detect if there are any words in a vector also found in another vector.
str_rm_non_printable

Remove and replace non-printable characters from strings.
rm_short_words

Delete rows in a text.table where the word has less than a minimum number of characters
str_rm_numbers

Remove and replace numbers from strings.
str_rm_punctuation

Remove and replace punctuation from strings.
str_count_intersect

Count the intersecting words in a vector that are found in another vector (only counts unique words).
str_rm_words_by_length

Remove words from a vector based on the number of characters in each word.
str_count_setdiff

Count the words in a vector that don't intersect with another vector (only counts unique words).
str_extract_match

Extract words from a vector that are found in another vector.
str_extract_nomatch

Extract words from a vector that are not found in another vector.
str_rm_short_words

Remove words from a vector that don't have a minimum number of characters.
str_rm_regexp_match

Remove words from a vector that match a regular expression.
str_extract_positional_nomatch

Extract words from a vector that are not found in the same position in another vector.
str_stopwords_by_part_of_speech

Create a vector of English words associated with particular parts of speech.
str_tolower

Calls base::tolower(), which converts letters to lowercase. Only included to point out that base::tolower exists and should be used directly.
rm_frequent_words

Delete rows in a text.table where the number of identical records within a group is more than a certain threshold
ngrams

Create n-grams
regex_word

Regular expression that might be used to split strings of text into component words.
label_parts_of_speech

Add a column with the parts of speech for each word in a text.table
as.text.table

Convert a data.table column of character vectors into a column with one row per word grouped by a grouping column. Optionally will split a column of strings into vectors of constituents.
pos

Parts of speech for English words from the Moby Project.
regex_paragraph

Regular expression that might be used to split strings of text into component paragraphs.
l_pos

Parts of speech for English words from the Moby Project.