Creates dataObject
a object from input data. Input data can be
a data.frame
or data.table
, a path to such tables on a local or network
drive, or a path to tabular data that may be converted to these formats.
In addition, a familiarEnsemble
or familiarModel
object can be passed
along to check whether the data are formatted correctly, e.g. by checking
the levels of categorical features, whether all expected columns are
present, etc.
as_data_object(data, ...)# S4 method for dataObject
as_data_object(data, object = NULL, ...)
# S4 method for data.table
as_data_object(
data,
object = NULL,
sample_id_column = waiver(),
batch_id_column = waiver(),
series_id_column = waiver(),
development_batch_id = waiver(),
validation_batch_id = waiver(),
outcome_name = waiver(),
outcome_column = waiver(),
outcome_type = waiver(),
event_indicator = waiver(),
censoring_indicator = waiver(),
competing_risk_indicator = waiver(),
class_levels = waiver(),
exclude_features = waiver(),
include_features = waiver(),
reference_method = waiver(),
check_stringency = "strict",
...
)
# S4 method for ANY
as_data_object(
data,
object = NULL,
sample_id_column = waiver(),
batch_id_column = waiver(),
series_id_column = waiver(),
...
)
A dataObject
object.
A data.frame
or data.table
, a path to such tables on a local
or network drive, or a path to tabular data that may be converted to these
formats.
Unused arguments.
A familiarEnsemble
or familiarModel
object that is used to
check consistency of these objects.
(recommended) Name of the column containing
sample or subject identifiers. See batch_id_column
above for more
details.
If unset, every row will be identified as a single sample.
(recommended) Name of the column containing batch or cohort identifiers. This parameter is required if more than one dataset is provided, or if external validation is performed.
In familiar any row of data is organised by four identifiers:
The batch identifier batch_id_column
: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets.
The sample identifier sample_id_column
: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level.
The series identifier series_id_column
: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view.
The repetition identifier: Indicates repeated measurements in a single series where any feature values may differ, but the outcome does not. Repetition identifiers are always implicitly set when multiple entries for the same series of the same sample in the same batch that share the same outcome are encountered.
(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See batch_id_column
above for more details.
If unset, rows which share the same batch and sample identifiers but have a different outcome are assigned unique series identifiers.
(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in validation_batch_id
for external validation.
Required if external validation is performed and validation_batch_id
is
not provided.
(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in development_batch_id
for external
validation, or none if not. Required if development_batch_id
is not
provided.
(optional) Name of the modelled outcome. This name will
be used in figures created by familiar
.
If not set, the column name in outcome_column
will be used for
binomial
, multinomial
, count
and continuous
outcomes. For other
outcomes (survival
and competing_risk
) no default is used.
(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that survival
and competing_risk
outcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.
(recommended) Type of outcome found in the outcome column. The outcome type determines many aspects of the overall process, e.g. the available feature selection methods and learners, but also the type of assessments that can be conducted to evaluate the resulting models. Implemented outcome types are:
binomial
: categorical outcome with 2 levels.
multinomial
: categorical outcome with 2 or more levels.
count
: Poisson-distributed numeric outcomes.
continuous
: general continuous numeric outcomes.
survival
: survival outcome for time-to-event data.
If not provided, the algorithm will attempt to obtain outcome_type from contents of the outcome column. This may lead to unexpected results, and we therefore advise to provide this information manually.
Note that competing_risk
survival analysis are not fully supported, and
is currently not a valid choice for outcome_type
.
(recommended) Indicator for events in survival
and competing_risk
analyses. familiar
will automatically recognise 1
,
true
, t
, y
and yes
as event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.
(recommended) Indicator for right-censoring in
survival
and competing_risk
analyses. familiar
will automatically
recognise 0
, false
, f
, n
, no
as censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.
(recommended) Indicator for competing
risks in competing_risk
analyses. There are no default values, and if
unset, all values other than those specified by the event_indicator
and
censoring_indicator
parameters are considered to indicate competing
risks.
(optional) Class levels for binomial
or multinomial
outcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.
(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in signature
,
novelty_features
or include_features
.
(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with exclude_features
, but may overlap signature
. Features in
signature
and novelty_features
are always included. If both
exclude_features
and include_features
are provided, include_features
takes precedence, provided that there is no overlap between the two.
(optional) Method used to set reference levels for categorical features. There are several options:
auto
(default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
always
: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
never
: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
Specifies stringency of various checks. This is mostly:
strict
: default value used for summon_familiar
. Thoroughly checks
input data. Used internally for checking development data.
external_warn
: value used for extract_data
and related methods. Less
stringent checks, but will warn for possible issues. Used internally for
checking data for evaluation and explanation.
external
: value used for external methods such as predict
. Less
stringent checks, particularly for identifier and outcome columns, which may
be completely absent. Used internally for predict
.
You can specify settings for your data manually, e.g. the column for
sample identifiers (sample_id_column
). This prevents you from having to
change the column name externally. In the case you provide a familiarModel
or familiarEnsemble
for the object
argument, any parameters you provide
take precedence over parameters specified by the object.