textcleaner: Text Cleaner

Description

An automated cleaning function for spell-checking, de-pluralizing, removing duplicates, and binarizing text data

Usage

textcleaner(
  data,
  miss = 99,
  partBY = c("row", "col"),
  dictionary = NULL,
  tolerance = 1
)

Arguments

data

Matrix or data frame. A dataset of text data. Participant IDs will be automatically identified if they are included. If no IDs are provided, then their order in the corresponding row (or column is used). A message will notify the user how IDs were assigned

miss

Numeric or character. Value for missing data. Defaults to 99

partBY

Character. Are participants by row or column? Set to "row" for by row. Set to "col" for by column

dictionary

Character vector. Can be a vector of a corpus or any text for comparison. Dictionary to be used for more efficient text cleaning. Defaults to NULL, which will use general.dictionary

Use dictionaries() or find.dictionaries() for more options (See SemNetDictionaries for more details)

tolerance

Numeric. The distance tolerance set for automatic spell-correction purposes. This function uses the function stringdist to compute the Damerau-Levenshtein (DL) distance, which is used to determine potential best guesses.

Unique words (i.e., n = 1) that are within the (distance) tolerance are automatically output as best.guess responses, which are then passed through word.check.wrapper. If there is more than one word that is within or below the distance tolerance, then these will be provided as potential options.

The recommended and default distance tolerance is tolerance = 1, which only spell corrects a word if there is only one word with a DL distance of 1.

Value

This function returns a list containing the following objects:

binary

A matrix of responses where each row represents a participant and each column represents a unique response. A response that a participant has provided is a '1' and a response that a participant has not provided is a '0'

responses

A list containing two objects:

clean.resp A response matrix that has been spell-checked and de-pluralized with duplicates removed. This can be used as a final dataset for analyses (e.g., fluency of responses)
orig.resp The original response matrix that has had white spaces before and after words response. Also converts all upper-case letters to lower case

spellcheck

A list containing three objects:

full All responses regardless of spell-checking changes
auto Only the incorrect responses that were changed during spell-check

removed

A list containing two objects:

rows Identifies removed participants by their row (or column) location in the original data file
ids Identifies removed participants by their ID (see argument data)

partChanges

A list where each participant is a list index with each response that was been changed. Participants are identified by their ID (see argument data). This can be used to replicate the cleaning process and to keep track of changes more generally. Participants with NA did not have any changes from their original data and participants with missing data are removed (see removed$ids)

Details

When working through the menu options in textcleaner, there may be mistakes. For instance, selecting to REMOVE a response when really all you wanted to do was RENAME a response. There are a couple of options:

RECOMMENDED

1. You can make a note in your R script for the change you wanted to make (you can keep moving through the cleaning process). After the cleaning process is through, you can check the spellcheck$auto output of textcleaner to see what changes you made. To correct any changes you made in the cleaning process, you can use the correct.changes function

NOT RECOMMENDED

2. You can use esc to exit out of a menu selection process. This is NOT recommended because you will lose all changes that you've made up to that point

References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3, 22-28. doi:10.32614/RJ-2011-014

Examples

Run this code

# NOT RUN {
# Toy example
raw <- open.animals[c(1:10),-c(1:3)]

# Clean and prepocess data
clean <- textcleaner(raw, partBY = "row", dictionary = "animals")
if(interactive())
{
    #Full test
    clean <- textcleaner(open.animals[,-c(1,2)], partBY = "row", dictionary = "animals")
}

# }

Run the code above in your browser using DataLab