Performs a standardization of data (z-scoring), i.e., centering and scaling,
so that the data is expressed in terms of standard deviation (i.e., mean = 0,
SD = 1) or Median Absolute Deviance (median = 0, MAD = 1). When applied to a
statistical model, this function extracts the dataset, standardizes it, and
refits the model with this standardized version of the dataset. The
normalize()
function can also be used to scale all numeric variables within
the 0 - 1 range.
standardize(
x,
robust = FALSE,
two_sd = FALSE,
weights = NULL,
verbose = TRUE,
...
)# S3 method for numeric
standardize(
x,
robust = FALSE,
two_sd = FALSE,
weights = NULL,
verbose = TRUE,
reference = NULL,
...
)
# S3 method for data.frame
standardize(
x,
robust = FALSE,
two_sd = FALSE,
weights = NULL,
verbose = TRUE,
reference = NULL,
select = NULL,
exclude = NULL,
remove_na = c("none", "selected", "all"),
force = FALSE,
append = FALSE,
suffix = "_z",
...
)
# S3 method for default
standardize(
x,
robust = FALSE,
two_sd = FALSE,
weights = TRUE,
verbose = TRUE,
include_response = TRUE,
...
)
unstandardize(
x,
center = NULL,
scale = NULL,
reference = NULL,
robust = FALSE,
two_sd = FALSE,
...
)
A data frame, a vector or a statistical model (for unstandardize()
cannot be a model).
Logical, if TRUE
, centering is done by subtracting the
median from the variables and dividing it by the median absolute deviation
(MAD). If FALSE
, variables are standardized by subtracting the
mean and dividing it by the standard deviation (SD).
If TRUE
, the variables are scaled by two times the deviation
(SD or MAD depending on robust
). This method can be useful to obtain
model coefficients of continuous parameters comparable to coefficients
related to binary predictors, when applied to the predictors (not the
outcome) (Gelman, 2008).
Can be NULL
(for no weighting), or:
For model: if TRUE
(default), a weighted-standardization is carried out.
For data.frame
s: a numeric vector of weights, or a character of the
name of a column in the data.frame
that contains the weights.
For numeric vectors: a numeric vector of weights.
Toggle warnings and messages on or off.
Arguments passed to or from other methods.
A dataframe or variable from which the centrality and deviation will be computed instead of from the input variable. Useful for standardizing a subset or new data according to another dataframe.
Character vector of column names. If NULL
(the default), all
variables will be selected.
Character vector of column names to be excluded from selection.
How should missing values (NA
) be treated: if "none"
(default): each column's standardization is done separately, ignoring
NA
s. Else, rows with NA
in the columns selected with select
/
exclude
("selected"
) or in all columns ("all"
) are dropped before
standardization, and the resulting data frame does not include these cases.
Logical, if TRUE
, forces standardization of factors and dates
as well. Factors are converted to numerical values, with the lowest level
being the value 1
(unless the factor has numeric levels, which are
converted to the corresponding numeric value).
Logical, if TRUE
and x
is a data frame, standardized
variables will be added as additional columns; if FALSE
,
existing variables are overwritten.
Character value, will be appended to variable (column) names of
x
, if x
is a data frame and append = TRUE
.
For a model, if TRUE
(default), the response value
will also be standardized. If FALSE
, only the predictors will be
standardized. Note that for certain models (logistic regression, count
models, ...), the response value will never be standardized, to make
re-fitting the model work. (For mediate
models, only applies to the y
model; m model's response will always be standardized.)
Used by unstandardize()
; center
and scale
correspond to the center (the mean / median) and the scale (SD / MAD) of
the original non-standardized data (for data frames, should be named, or
have column order correspond to the numeric column). However, one can also
directly provide the original data through reference
, from which the
center and the scale will be computed (according to robust
and two_sd
.
Alternatively, if the input contains the attributes center
and scale
(as does the output of standardize()
), it will take it from there if the
rest of the arguments are absent.
The standardized object (either a standardize data frame or a statistical model fitted on standardized data).
If x
is a model object, standardization is done by completely refitting the
model on the standardized data. Hence, this approach is equal to
standardizing the variables before fitting the model and will return a new
model object. However, this method is particularly recommended for complex
models that include interactions or transformations (e.g., polynomial or
spline terms). The robust
(default to FALSE
) argument enables a robust
standardization of data, i.e., based on the median
and MAD
instead of the
mean
and SD
. See standardize_parameters()
for other methods of
standardizing model coefficients.
When the model's formula contains transformations (e.g. y ~ exp(X)
) the
transformation effectively takes place after standardization (e.g.,
exp(scale(X))
). Some transformations are undefined for negative values,
such as log()
and sqrt()
. To avoid dropping these values, the
standardized data is shifted by Z - min(Z) + 1
or Z - min(Z)
(respectively).
When standardizing coefficients of a generalized model (GLM, GLMM, etc), only the predictors are standardized, maintaining the interpretability of the coefficients (e.g., in a binomial model: the exponent of the standardized parameter is the OR of a change of 1 SD in the predictor, etc.)
Other transform utilities:
change_scale()
,
normalize()
,
ranktransform()
Other standardize:
standardize_info()
,
standardize_parameters()
# NOT RUN {
# Data frames
summary(standardize(swiss))
# Models
model <- lm(Infant.Mortality ~ Education * Fertility, data = swiss)
coef(standardize(model))
# }
Run the code above in your browser using DataLab