An automated cleaning function for spell-checking, de-pluralizing, removing duplicates, and binarizing text data
textcleaner(
data = NULL,
miss = 99,
partBY = c("row", "col"),
dictionary = NULL,
spelling = c("UK", "US"),
add.path = NULL,
keepStrings = FALSE,
allowPunctuations = c("-", "all"),
allowNumbers = FALSE,
lowercase = TRUE,
continue = NULL
)
Matrix or data frame. A dataset of text data. Participant IDs will be automatically identified if they are included. If no IDs are provided, then their order in the corresponding row (or column is used). A message will notify the user how IDs were assigned
Numeric or character.
Value for missing data.
Defaults to 99
Character.
Are participants by row or column?
Set to "row"
for by row.
Set to "col"
for by column
Character vector.
Can be a vector of a corpus or any text for comparison.
Dictionary to be used for more efficient text cleaning.
Defaults to NULL
, which will use general.dictionary
Use dictionaries()
or find.dictionaries()
for more options
(See SemNetDictionaries
for more details)
Character vector. English spelling to be used.
"UK"
For British spelling (e.g., colour, grey, programme, theatre)
"US"
For American spelling (e.g., color, gray, program, theater)
Character.
Path to additional dictionaries to be found.
DOES NOT search recursively (through all folders in path)
to avoid time intensive search.
Set to "choose"
to open an interactive directory explorer
Boolean.
Should strings be retained or separated?
Defaults to FALSE
.
Set to TRUE
to retain strings as strings
Character vector.
Allows punctuation characters to be included in responses.
Defaults to "-"
.
Set to "all"
to keep all punctuation characters
Boolean.
Defaults to FALSE
.
Set to TRUE
to keep numbers in text
Boolean.
Should words be converted to lowercase?
Defaults to TRUE
.
Set to FALSE
to keep words as they are
List.
A result previously unfinished that still needs to be completed.
Allows you to continue to manually spell-check their data
after you've closed or errored out.
Defaults to NULL
This function returns a list containing the following objects:
A matrix of responses where each row represents a participant
and each column represents a unique response. A response that a participant has provided is a '1
'
and a response that a participant has not provided is a '0
'
A list containing two objects:
clean
A response matrix that has been spell-checked and de-pluralized with duplicates removed.
This can be used as a final dataset for analyses (e.g., fluency of responses)
original
The original response matrix that has had white spaces before and
after words response. Also converts all upper-case letters to lower case
A list containing three objects:
full
All responses regardless of spell-checking changes
auto
Only the incorrect responses that were changed during spell-check
A list containing two objects:
rows
Identifies removed participants by their row (or column) location in the original data file
ids
Identifies removed participants by their ID (see argument data
)
A list where each participant is a list index with each
response that was been changed. Participants are identified by their ID (see argument data
).
This can be used to replicate the cleaning process and to keep track of changes more generally.
Participants with NA
did not have any changes from their original data
and participants with missing data are removed (see removed$ids
)
Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3, 22-28.
# NOT RUN {
# Toy example
raw <- open.animals[c(1:10),-c(1:3)]
if(interactive())
{
#Full test
clean <- textcleaner(open.animals[,-c(1,2)], partBY = "row", dictionary = "animals")
}
# }
Run the code above in your browser using DataLab