rm_infrequent_words: Delete rows in a text.table where the number of identical records within a group is less than a certain threshold

Description

Delete rows in a text.table where the number of identical records within a group is less than a certain threshold

Usage

rm_infrequent_words(
  x,
  text,
  count_col_name = NULL,
  group_by = c(),
  min_count,
  min_count_is_ratio = FALSE,
  total_count_col = NULL
)

Arguments

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the term frequency.

count_col_name

A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

min_count

A number, the minimum number of times a word must occur to keep.

min_count_is_ratio

TRUE/FALSE, if TRUE, implies the value passed to min_count should be considered a ratio.

total_count_col

Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if min_count_is_ratio is TRUE.

Value

A text.table, with rows having a duplicate count of less than a certain threshold deleted.

Examples

Run this code

# NOT RUN {
rm_infrequent_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
count_col_name = "count",
min_count = 4
)

rm_infrequent_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the
        newspaper and it is the nice kind of dog."),
        tolower("The dog is extremely nice because it does the dishes
        and it is cool.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
count_col_name = "count",
group_by = "col1",
min_count = 2
)
# }

Run the code above in your browser using DataLab