gbm_dist: GBM Distribution

Description

Generates distribution object for gbmt.

Usage

gbm_dist(name="Gaussian", ...)

Value

returns a GBMDist object.

Arguments

name

The name (a string) of the distribution to be initialized and used in fitting a gradient boosted model via gbmt. The current distributions available can be viewed using the function available_distributions. If no distribution is specified this function constructs a Gaussian distribution by default.

...

Extra named parameters required for initializing certain distributions. If t-distribution is selected, an additional parameter (df) specifying the number of degrees of freedom can be given. The default degrees of freedom is set to four.

If quantile is selected then the quantile to estimate may be specified using the named parameter alpha. The default quantile to estimate is 0.25.

If the tweedie distribution is selected the power-law specifying the distribution may be set via the named parameter: power. This parameter defaults to unity.

If a Cox Partial Hazards model is selected a number of additional parameters are required, these are:

strata: A vector of integers (or factors) specifying which strata each data-row belongs to, if none is specified it is assumed all training data is in the same stratum.

ties

String specifying the method to be used when dealing with tied event times. Currently only "breslow" and "efron" are available, with the latter being the default.

prior_node_coeff_var

It is a prior on the coefficient of variation associated with the hazard rate assigned to each terminal node when fitting a tree. Increasing its value emphasizes the importance of the training data in the node when assigning a prediction to said node. This defaults to 1000.

Finally, if the pairwise distribution is selected a number of parameters also need to be specified. These parameters are group, metric and max_rank. The first is a character vector with the column names of data that jointly indicate the group an instance belongs to (typically a query in Information Retrieval). For training, only pairs of instances from the same group and with different target labels may be considered. metric is the IR measure to use, one of

list("conc"): Fraction of concordant pairs; for binary labels, this is equivalent to the Area under the ROC Curve
:: Fraction of concordant pairs; for binary labels, this is equivalent to the Area under the ROC Curve
list("mrr"): Mean reciprocal rank of the highest-ranked positive instance
:: Mean reciprocal rank of the highest-ranked positive instance
list("map"): Mean average precision, a generalization of mrr to multiple positive instances
:: Mean average precision, a generalization of mrr to multiple positive instances
list("ndcg:"): Normalized discounted cumulative gain. The score is the weighted sum (DCG) of the user-supplied target values, weighted by log(rank+1), and normalized to the maximum achievable value. This is the default if the user did not specify a metric.

ndcg and conc allow arbitrary target values, while binary targets {0,1} are expected for map and mrr. For ndcg and mrr, a cut-off can be chosen using a positive integer parameter max_rank. If left unspecified, all ranks are taken into account.

Author

James Hickey