Learn R Programming

prepdat (version 1.0.8)

prep: Creates One Finalized Table Ready for Statistical Analysis

Description

prep() aggregates a single dataset in a long format according to any number of grouping variables. This makes prep() suitable for aggregating data from various types of experimental designs such as between-subjects, within-subjects (i.e., repeated measures), and mixed designs (i.e., experimental designs that include both between- and within- subjects independent variables). prep() returns a data frame with a number of dependent measures for further analysis for each aggregated cell (i.e., experimental cell) according to the provided grouping variables (i.e., independent variables). Dependent measures for each experimental cell include among others means before and after rejecting observations according to a flexible standard deviation criteria, number of rejected observations according to the flexible standard deviation criteria, proportions of rejected observations according to the flexible standard deviation criteria, number of observations before rejection, means after rejecting observations according to procedures described in Van Selst & Jolicoeur (1994; suitable when measuring reaction-times), standard deviations, medians, means according to any percentile (e.g., 0.05, 0.25, 0.75, 0.95) and harmonic means. The data frame prep() returns can also be exported as a txt or csv file to be used for statistical analysis in other statistical programs.

Usage

prep( dataset = NULL , file_name = NULL , file_path = NULL , id = NULL , within_vars = c() , between_vars = c() , dvc = NULL , dvd = NULL , keep_trials = NULL , drop_vars = c() , keep_trials_dvc = NULL , keep_trials_dvd = NULL , id_properties = c() , sd_criterion = c(1, 1.5, 2) , percentiles = c(0.05, 0.25, 0.75, 0.95) , outlier_removal = NULL , keep_trials_outlier = NULL , decimal_places = 4 , notification = TRUE , dm = c() , save_results = TRUE , results_name = "results.txt" , results_path = NULL , save_summary = TRUE )

Arguments

dataset
Name of the data frame in R that contains the long format table after merging the individual data files using file_merge(). Either dataset or file_name must be provided. Default is NULL.
file_name
A string with the name of a txt or csv file (including the file extension, e.g. "my_data.txt") with the merged table in case the user already merged the individual data files. Either dataset or file_name must be provided. Default is NULL.
file_path
A string with the path of the folder in which file_name is located. If file_name was used, then file_path must be provided. Default is NULL.
id
A string with the name of the column in file_name or in dataset that contains the variable specifying the case identifier (i.e., the variable upon which the measurement took place; e.g., "subject_number"). This should be a unique value per case. Values in this column must be numeric. Argument must be provided. Default is NULL.
within_vars
String vector with names of grouping variables in file_name or in dataset that contain independent variables manipulated (or observed) within-ids (i.e., within-subjects, repeated measures). Single or multiple values must be specified as a string (e.g., c("SOA", "condition")) according to the hierarchical order you wish. Note that the order of the names in within_vars() is important because prep() aggregates the data for the dependent measures by first dividing them to the levels of the first grouping variable in witin_vars(), and then within each of those levels prep() divides the data according to the next variable in within_vars() and so forth. Values in these columns must be numeric. Either within_vars or between_vars (or both) arguments must be provided. Default is c().
between_vars
String vector with names of grouping variables in file_name or in dataset that contain independent variables manipulated (or observed) between-ids (i.e., between-subjects). Single or multiple values must be specified as a string (e.g., c("order")). Order of the names in between_vars() does not matter. Values in this column must be numeric. Either between_vars or within_vars (or both) arguments must be provided. Default is c().
dvc
A string with the name of the column in file_name or in dataset that contains the dependent variable (e.g., "rt" for reaction-time as a dependent variable). Values in this column must be in an interval or ratio scale. Either dvc or dvd (or both) arguments must be provided. Default is NULL.
dvd
A string with the name of the column in file_name or in dataset that contains the dependent variable (e.g., "ac" for accuracy as a dependent variable). Values in this column must be numeric and discrete (e.g., 0 and 1). Either dvc or dvd (or both) arguments must be provided. Default is NULL.
keep_trials
A string. Allows deleting unnecessary observations and keeping necessary observations in file_name or in dataset according to logical conditions specified as a string. For example, if the dataset contains practice trials for each subject, these trials should not be included in the aggregation. The user should remove these trials by specifying how they were coded in the raw data (i.e., data before aggregation). For example, if practice trials are the ones for which the "block" column in the raw data tables equals to zero, the keep_trials argument should be "raw_data$block !== 0". raw_data is the internal object in prep() representing the merged table. All logical conditions in keep_trials should be put in the same string and be concatenated by & or |. Logical conditions for this argument can relate to different columns in the merged table. Note that all further arguments of prep() will relate to the remaining observations in the merged table. Default is NULL.
drop_vars
String vector with names of columns to delete in file_name or in dataset. Single or multiple values must be specified as a string (e.g., c("font_size")). Order of the names in drop_vars does not matter. Note that all further arguments of prep() will relate to the remaining variables in the merged table. Default is c().
keep_trials_dvc
A string. Allows deleting unnecessary observations and keeping necessary observations in file_name or in dataset for calculations and aggregation of the dependent variable in dvc according to logical conditions specified as a string. Logical conditions should be specified as a string as in the keep_trials argument (e.g., "raw_data$rt > 100 & raw_data$rt < 3000 & raw_dada$ac == 1"). All dependent measures for dvc except for those specified in outlier_removal will be calculated on the remaining observations. Defalut is NULL.
keep_trials_dvd
A string. Allows deleting unnecessary observations and keeping necessary observations in file_name or in dataset for calculations and aggregation of the dependent variable in dvd according to logical conditions specified as a string. Logical conditions should be specified as a string as in the keep_trials argument (e.g., raw_data$rt > 100 & raw_data$rt < 3000). All dependent measures for dvd (i.e., "mdvd" and "merr") will be calculated on the remaining observations. Default is NULL.
id_properties
String vector with names of columns in dataset or in file_name that describe the ids (e.g., subjects) in the data and were not manipulated within-or between-ids. For example, in case the user logged for each observation and for each id in an experiment also the age and the gender of the subject, this argument will be c("age", "gender"). Order of the names in id_properties does not matter. Single or multiple values must be specified as a string. Values in these columns must be numeric. Default is c().
sd_criterion
Numeric vector specifying a number of standard deviation criteria for which prep() will calculate the mean dvc for each cell in the finalized table after rejecting observations that did not meet the criterion (e.g., rejecting observations that were more than 2 standard deviations above or below the mean of that cell). Values in this vector must be numeric. Default is c(1, 1.5, 2).
percentiles
Numeric vector containing wanted percentiles for dvc. Values in this vector must be decimal numbers between 0 to 1. Percentiles are calculated according to type = 7 (see quantile for more information). Default is c(0.05, 0.25, 0.75, 0.95).
outlier_removal
Numeric. Specifies which outlier removal procedure with moving criterion to calculate for dvc according to procedures described by Van Selst & Jolicoeur (1994). If 1 then non-recursive procedure is calculated, if 2 then modified recursive procedure is calculated, if 3 then hybrid recursive procedure is calculated. Moving criterion is according to Table 4 in Van Selst & Jolicoeur (1994). If experimental cell has 4 trials or less it will result in NA. Default is NULL.
keep_trials_outlier
A string. Allows deleting unnecessary observations and keeping necessary observations in file_name or in dataset for calculations and aggregation of the outlier removal procedures by Van Selst & Jolicoeur (1994). Logical conditions should be specified as a string as in the keep_trials argument (e.g., "raw_data$ac == 1"). outlier_removal procedure will be calculated on the remaining observations. Defalut is NULL.
decimal_places
Numeric. Specifies number of decimals to be written in results_name for each value of the dependent measures for dvc. Value must be numeric. Default is 4.
notification
Logical. If TRUE, prints messages about the progress of the function. Default is TRUE.
dm
String vector with names of dependent measures the function returns. If empty (i.e., c()) the function returns a data frame with all possible dependent measures in prep(). Values in this vector must be strings from the following list: "mdvc", "sdvc", "meddvc", "tdvc", "ntr", "ndvc", "ptr", "prt", "rminv", "mdvd", "merr". Default is c(). See Value section below for more details.
save_results
Logical. If TRUE, the function creates a txt file containing the returned data frame. Default is TRUE.
results_name
A string with the name of the file prep returns in case save_results is TRUE. Extension of the file can be txt or csv and should be included. Default is "results.txt".
results_path
A string with the path of the folder in which results_name will be saved. Default is the path provided in file_path. In case no path was provided in file_path, results_path must be provided.
save_summary
Logical. if TRUE, creates a summary file in the same format as results_name. Default is TRUE.

Value

A data frame with dependent measures for the dependent variables in dvc and dvd by id and grouping variables.The first column in the finalized table is the id column. In case id_properties was used, the next columns will be the value of each id_properties for each id.If between_vars was used then the next column{}s will be the value of each beween_vars for each id.The next columns of the finalized table contain the dependent measures according to the design specified. If within_vars was used, then the data for each dependent measure was first divided according to the levels of the first grouping variable in witin_vars(), and then within each of those levels prep() divided the data according to the next variable in within_vars() and so forth. The dependent measures in the finalized table are:mdvc: mean dvc.sdvc: SD for dvc.meddvc: median dvc.tdvc: mean dvc after rejecting observations above standard deviation criteria specified in sd_criterion.ntr: number of observations rejected for each standard deviation criterion specified in sd_criterion.ndvc: number of observations before rejection.ptr: proportion of observations rejected for each standard deviation criterion specified in sd_criterion.rminv: harmonic mean of dvc.prt: dvc according to each of the percentiles specified in percentiles.mdvd: mean dvd.merr: mean error.nrmc: mean dvc according to non-recursive procedure with moving criterion.nnrmc: number of observations rejected for dvc according to non-recursive procedure with moving criterion.pnrmc: percent of observations rejected for dvc according to non-recursive procedure with moving criterion.tnrmc: total number of observations upon which the non-recursive procedure with moving criterion was applied.mrmc: mean dvc according to modified-recursive procedure with moving criterion.nmrmc: number of observations rejected for dvc according to modified-recursive procedure with moving criterion.pmrmc: percent of observations rejected for dvc according to modified-recursive procedure with moving criterion.tmrmc: total number of observations upon which the modified-recursive procedure with moving criterion was applied.hrmc: mean dvc according to hybrid-recursive procedure with moving criterion.nhrmc: number of observations rejected for dvc according to hybrid-recursive procedure with moving criterion.thrmc: total number of observations upon which the hybrid-recursive procedure with moving criterion was applied.

References

Grange, J.A. (2015). trimr: An implementation of common response time trimming methods. R Package Version 1.0.1. https://CRAN.R-project.org/package=trimr

Van Selst, M., & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. The quarterly journal of experimental psychology, 47(3), 631-650.

Examples

Run this code
data(stroopdata)
finalized_stroopdata <- prep(
           dataset = stroopdata
           , file_name = NULL
           , file_path = NULL
           , id = "subject"
           , within_vars = c("block", "target_type")
           , between_vars = c("order")
           , dvc = "rt"
           , dvd = "ac"
           , keep_trials = NULL
           , drop_vars = c()
           , keep_trials_dvc = "raw_data$rt > 100 & raw_data$rt < 3000 & raw_data$ac == 1"
           , keep_trials_dvd = "raw_data$rt > 100 & raw_data$rt < 3000"
           , id_properties = c()
           , sd_criterion = c(1, 1.5, 2)
           , percentiles = c(0.05, 0.25, 0.75, 0.95)
           , outlier_removal = 2
           , keep_trials_outlier = "raw_data$ac == 1"
           , decimal_places = 0
           , notification = TRUE
           , dm = c()
           , save_results = FALSE
           , results_name = "results.txt"
           , results_path = NULL
           , save_summary = FALSE
         )

Run the code above in your browser using DataLab