rm_frequent_words: Delete rows in a text.table where the number of identical records within a group is more than a certain threshold

Description

Delete rows in a text.table where the number of identical records within a group is more than a certain threshold

Usage

rm_frequent_words(
  x,
  text,
  count_col_name = NULL,
  group_by = c(),
  max_count,
  max_count_is_ratio = FALSE,
  total_count_col = NULL
)

Arguments

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the term frequency.

count_col_name

A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

max_count

A number, the maximum number of times a word can occur to keep.

max_count_is_ratio

TRUE/FALSE, if TRUE, implies the value passed to max_count should be considered a ratio.

total_count_col

Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if max_count_is_ratio is TRUE.

Value

A text.table, with rows having a duplicate count over a certain threshold deleted.

Examples

Run this code

# NOT RUN {
rm_frequent_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
count_col_name = "count",
max_count = 1
)
# }

Run the code above in your browser using DataLab