polyclass: Polyclass: polychotomous regression and multiple classification

Description

Fit a polychotomous regression and multiple classification using linear splines and selected tensor products.

Usage

polyclass(data, cov, weight, penalty, maxdim, exclude, include,
additive = FALSE, linear, delete = 2, fit,  silent = TRUE, 
normweight = TRUE, tdata, tcov, tweight, cv, select, loss, seed)

Arguments

data

vector of classes: data should ranges over consecutive integers with 0 or 1 as the minimum value.

cov

covariates: matrix with as many rows as the length of data.

weight

optional vector of case-weights. Should have the same length as data.

penalty

the parameter to be used in the AIC criterion if the model selection is carried out by AIC. The program chooses the number of knots that minimizes -2 * loglikelihood + penalty * (dimension). The default is to use penalty = log(length(data)) as in BIC. If the model selection is carried out by cross-validation or using a test set, the program uses the number of knots that minimizes loss + penalty * dimension * (loss for smallest model). In this case the default of penalty is 0.

maxdim

maximum dimension (default is $\min(n, 4 * n^{1/3}*(cl-1)$, where $n$ is length(data) and $cl$ the number of classes.

exclude

combinations to be excluded - this should be a matrix with 2 columns - if for example exclude[1, 1] = 2 and exclude[1, 2] = 3 no interaction between covariate 2 and 3 is included. 0 represents time.

include

those combinations that can be included. Should have the same format as exclude. Only one of exclude and include can be specified .

additive

should the model selection be restricted to additive models?

linear

vector indicating for which of the variables no knots should be entered. For example, if linear = c(2, 3) no knots for either covariate 2 or 3 are entered. 0 represents time.

delete

should complete basis functions be deleted at once (2), should only individual dimensions be deleted (1) or should only the addition stage of the model selection be carried out (0)?

fit

polyclass object. If fit is specified, polyclass adds basis functions starting with those in fit.

silent

suppresses the printing of diagnostic output about basis functions added or deleted, Rao-statistics, Wald-statistics and log-likelihoods.

normweight

should the weights be normalized so that they average to one? This option has only an effect if the model is selected using AIC.

tdata,tcov,tweight

test set. Should satisfy the same requirements as data, cov and weight. If all test set weights are one, tweight can be omitted. If tdata and tcov are specified, the model selection is carried out using this test set, irrespective of the input for penalty or cv.

in how many subsets should the data be divided for cross-validation? If cv is specified and tdata is omitted, the model selection is carried out by cross-validation.

select

if a test set is provided, or if the model is selected using cross validation, should the model be select that minimizes (misclassification) loss (0), that maximizes test set log-likelihood (1) or that minimizes test set squared error loss (2)?

loss

a rectangular matrix specifying the loss function, whose size is the number of classes times number of actions. Used for cross-validation and test set model selection. loss[i, j] contains the loss for assigning action j to an object whose true class is i. The default is 1 minus the identity matrix. loss does not need to be square.

seed

optional seed for the random number generator that determines the sequence of the cases for cross-validation. If the seed has length 12 or more, the first twelve elements are assumed to be .Random.seed, otherwise the function set.seed is used. If seed is 0 or rep(0, 12), it is assumed that the user has already provided a (random) ordering. If seed is not provided, while a fit with an element fit\$seed is provided, .Random.seed is set using set.seed(fit\$seed). Otherwise the present value of .Random.seed is used.

Value

The output is an object of class polyclass, organized to serve as input for plot.polyclass, beta.polyclass, summary.polyclass, ppolyclass (fitted probabilities), cpolyclass (fitted classes) and rpolyclass (random classes). The function returns a list with the following members:

call

the command that was executed.

ncov

number of covariates.

ndim

number of dimensions of the fitted model.

nclass

number of classes.

nbas

number of basis functions.

naction

number of possible actions that are considered.

fcts

matrix of size nbas x (nclass + 4). each row is a basis function. First element: first covariate involved (NA = constant);

second element: which knot (NA means: constant or linear);

third element: second covariate involved (NA means: this is a function of one variable);

fourth element: knot involved (if the third element is NA, of no relevance);

fifth, sixth,... element: beta (coefficient) for class one, two, ...

knots

a matrix with ncov rows. Covariate i has row i+1, time has row 1. First column: number of knots in this dimension; other columns: the knots, appended with NAs to make it a matrix.

in how many sets was the data divided for cross-validation. Only provided if method = 2.

loss

the loss matrix used in cross-validation and test set. Only provided if method = 1 or method = 2.

penalty

the parameter used in the AIC criterion. Only provided if method = 0.

method

0 = AIC, 1 = test set, 2 = cross-validation.

ranges

column i gives the range of the i-th covariate.

logl

matrix with eight or eleven columns. Summarizes fits. Column one indicates the dimension, column column two the AIC or loss value, whichever was used during the model selection appropriate, column three four and five give the training set log-likelihood, (misclassification) loss and squared error loss, columns six to eight give the same information for the test set, column nine (or column six if method = 0 or method = 2) indicates whether the model was fitted during the addition stage (1) or during the deletion stage (0), column ten and eleven (or seven and eight) the minimum and maximum penalty parameter for which AIC would have selected this model.

sample

sample size.

tsample

the sample size of the test set. Only prvided if method = 1.

wgtsum

sum of the case weights.

covnames

names of the covariates.

classnames

(numerical) names of the classes.

cv.aic

the penalty value that was determined optimal by by cross validation. Only provided if method = 2.

cv.tab

table with three columns. Column one and two indicate the penalty parameter range for which the cv-loss in column three would be realized. Only provided if method = 2.

seed

the random seed that was used to determine the order of the cases for cross-validation. Only provided if method = 2.

delete

were complete basis functions deleted at once (2), were only individual dimensions deleted (1) or was only the addition stage of the model selection carried out (0)?

beta

moments of basisfunctions. Needed for beta.polyclass.

select

if a test set is provided, or if the model is selected using cross validation, was the model selected that minimized (misclassification) loss (0), that maximized test set log-likelihood (1) or that minimized test set squared error loss (2)?

anova

matrix with three columns. The first two elements in a line indicate the subspace to which the line refers. The third element indicates the percentage of variance explained by that subspace.

twgtsum

sum of the test set case weights (only if method = 1).

References

Charles Kooperberg, Smarajit Bose, and Charles J. Stone (1997). Polychotomous regression. Journal of the American Statistical Association, 92, 117--127.

Charles J. Stone, Mark Hansen, Charles Kooperberg, and Young K. Truong. The use of polynomial splines and their tensor products in extended linear modeling (with discussion) (1997). Annals of Statistics, 25, 1371--1470.

Examples

Run this code

# NOT RUN {
data(iris)
fit.iris <- polyclass(iris[,5], iris[,1:4])
# }

Run the code above in your browser using DataLab