Generates distribution object for gbmt.
gbm_dist(name="Gaussian", ...)
returns a GBMDist
object.
The name (a string) of the distribution to be initialized and used
in fitting a gradient boosted model via gbmt. The current distributions
available can be viewed using the function available_distributions
. If no
distribution is specified this function constructs a Gaussian distribution by
default.
Extra named parameters required for initializing certain distributions.
If t-distribution is selected, an additional parameter (df
) specifying the number of
degrees of freedom can be given. The default degrees of freedom is set to four.
If quantile is selected then the quantile to estimate may be specified using the
named parameter alpha
. The default quantile to estimate is 0.25.
If the tweedie distribution is selected the power-law specifying the distribution
may be set via the named parameter: power
. This parameter defaults to unity.
If a Cox Partial Hazards model is selected a number of additional parameters are required, these are:
strata
A vector of integers (or factors) specifying which strata each data-row belongs to, if none is specified it is assumed all training data is in the same stratum.
ties
String specifying the method to be used when dealing with tied event times. Currently only "breslow" and "efron" are available, with the latter being the default.
prior_node_coeff_var
It is a prior on the coefficient of variation associated with the hazard rate assigned to each terminal node when fitting a tree. Increasing its value emphasizes the importance of the training data in the node when assigning a prediction to said node. This defaults to 1000.
Finally, if the pairwise distribution is selected a number of parameters also need to be
specified. These parameters are group
, metric
and max_rank
.
The first is a character vector with the column names of data that jointly indicate the group
an instance belongs to (typically a query in Information Retrieval). For
training, only pairs of instances from the same group and with different target
labels may be considered. metric
is
the IR measure to use, one of
Fraction of concordant pairs; for binary labels, this is equivalent to the Area under the ROC Curve
Fraction of concordant pairs; for binary labels, this is equivalent to the Area under the ROC Curve
Mean reciprocal rank of the highest-ranked positive instance
Mean reciprocal rank of the highest-ranked positive instance
Mean average precision, a generalization of mrr
to multiple positive instances
Mean average precision, a generalization of mrr
to multiple
positive instances
Normalized discounted cumulative gain. The score is the weighted sum (DCG) of the user-supplied target values, weighted by log(rank+1), and normalized to the maximum achievable value. This is the default if the user did not specify a metric.
ndcg
and conc
allow arbitrary target values, while binary
targets {0,1} are expected for map
and mrr
. For ndcg
and mrr
, a cut-off can be chosen using a positive integer parameter
max_rank
. If left unspecified, all ranks are taken into account.
James Hickey