di_iterate_sql: Iteratively calculate disproportionate impact using multiple methods for many variables, using SQL.

Description

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for many success variables, disaggregation variables, and scenarios, using SQL (for data stored in a database or in a parquet data file).

Usage

di_iterate_sql(
  db_conn,
  db_table_name,
  success_vars,
  group_vars,
  cohort_vars = NULL,
  scenario_repeat_by_vars = NULL,
  exclude_scenario_df = NULL,
  weight_var = NULL,
  include_non_disagg_results = TRUE,
  ppg_reference_groups = "overall",
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5,
  di_prop_index_cutoff = 0.8,
  di_80_index_cutoff = 0.8,
  di_80_index_reference_groups = "hpg",
  check_valid_reference = TRUE,
  parallel = FALSE,
  parallel_n_cores = parallel::detectCores()/2,
  mssql_flag = FALSE,
  return_what = "data",
  staging_table = paste0("DisImpact_Staging_", paste0(sample(1:9, size = 5, replace =
    TRUE), collapse = "")),
  drop_staging_table = TRUE
)

Value

When return_what='data' (default), a long data frame is returned (see the return value for di_iterate). When return_what='SQL' (default), a list object where each element is a query (character value) is returned.

Arguments

db_conn

A database connection object, returned by dbConnect.

db_table_name

A character value specifying a database table name.

success_vars

A character vector of success variable names to iterate across.

group_vars

A character vector of group (disaggregation) variable names to iterate across.

cohort_vars

(Optional) A character vector of the same length as success_vars to indicate the cohort variable to be used for each variable specified in success_vars. A vector of length 1 could be specified, in which case the same cohort variable is used for each success variable. If not specified, then a single cohort is assumed for all success variables (defaults to NULL).

scenario_repeat_by_vars

(Optional) A character vector of variables to repeat DI calculations for across all combination of these variables. For example, the following variables could be specified:

Ed Goal: Degree/Transfer, Shot-term Career, Non-credit
First time college student: Yes, No
Full-time status: Yes, No

Each combination of these variables (eg, full time, first time college students with an ed goal of degree/transfer as one combination) would constitute an iteration / sample for which to calculate disproportionate impact for outcomes listed in success_vars and for the disaggregation variables listed in group_vars. The overall rate of success for full time, first time college students with an ed goal of degree/transfer would just include these students and not others. Each variable specified is also collapsed to an '- All' group so that the combinations also reflect all students of a particular category. The total number of combinations for the previous example would be (+1 representing the all category): (3 + 1) x (2 + 1) x (2 + 1) = 36.

exclude_scenario_df

(Optional) A data frame with variables that match scenario_repeat_by_vars for specifying the combinations to exclude from DI calculations. Following the example specified above, one could choose to exclude part-time non-credit students from consideration.

weight_var

(Optional) A character variable specifying the weight variable if the input data set is summarized (i.e., the the success variables specified in success_vars contain count of successes). Weight here corresponds to the denominator when calculating the success rate. Defaults to NULL for an input data set where each row describes an individual.

include_non_disagg_results

A logical variable specifying whether or not the non-disaggregated results should be returned; defaults to TRUE. When TRUE, a new variable `- None` is added to the data set with a single data value '- All', and this variable is added to group_vars as a disaggregation/group variable. The user would want these results returned to review non-disaggregated results.

ppg_reference_groups

Either 'overall', 'hpg', 'all but current', or a character vector of the same length as group_vars that indicates the reference group value for each group variable in group_vars when determining disproportionate impact using the percentage point gap method.

min_moe

The minimum margin of error to be used in the PPG calculation; see di_ppg.

use_prop_in_moe

(TRUE or FALSE) Whether the estimated proportions should be used in the margin of error calculation by the PPG; see di_ppg.

prop_sub_0

Default is 0.50; see di_ppg.

prop_sub_1

Default is 0.50; see di_ppg.

di_prop_index_cutoff

Threshold used for determining disproportionate impact using the proportionality index; see di_prop_index; defaults to 0.80.

di_80_index_cutoff

Threshold used for determining disproportionate impact using the 80% index; see di_80_index; defaults to 0.80.

di_80_index_reference_groups

check_valid_reference

(TRUE or FALSE) Check whether ppg_reference_groups and di_80_index_reference_groups contain valid values; defaults to TRUE.

parallel

If TRUE, then perform calculations in parallel. The parallel feature is only supported when db_table_name is a path to a parquet file ('/path/to/data.parquet') and that db_conn is a connection to a duckdb database (e.g., dbConnect(duckdb(), dbdir=':memory:')). Defaults to FALSE.

parallel_n_cores

The number of CPU cores to use if parallel=TRUE. Defaults to half of the maximum number of CPU cores on the system.

mssql_flag

User-specified logical flag (TRUE or FALSE) that indicates if the MS SQL Server variant of the SQL language should be used.

return_what

A character value specifying the return value for the function call. For 'data', the function will return a long data frame with the disproportionate calculations and relevant statistics, after the calculations are performed on the SQL database engine. For 'SQL', a list object of individual queries will be returned for the user to execute elsewhere. Defaults to 'data'.

staging_table

A character value indicating the name of the staging or results table in the database for storing the disproportionate impact calculations.

drop_staging_table

TRUE/FALSE A logical flag indicating whether or not the staging table specified in staging_table should be dropped in the database after the results are returned to R; defaults to TRUE.

Details

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for all combinations of success_vars, group_vars, and cohort_vars, for each combination of subgroups specified by scenario_repeat_by_vars, using SQL (calculations done on the database engine or duckdb for parquet files).