kparts: K-Partitions Clustering

Description

Unsupervised vector partitioning.

Usage

kparts(x, y, parts, maxiter = 50, trials = 3,  nblind = FALSE, trialprint = TRUE, iterprint = FALSE)

Arguments

The numeric vector to be partitioned.

The numeric response variable vector used to partition vector x.

parts

The desired number of partitions.

maxiter

The maximum number of iterations allowed for each trial. If convergence does not occur, the trail will stop after the specified number of iterations is reached. The default is 50 iterations.

trials

The number of times the algorithm is run with new, randomly assigned partitions. The default number of trials is 3.

nblind

If TRUE, the algorithm will ignore the sum of squares within each unique value of x. The default is FALSE.

trialprint

If TRUE, the trial number and the sum of squares will print while the algorithm is running. The default is TRUE.

iterprint

If TRUE, the iteration number and sum of squares will print while the algorithm is running. The default is FALSE.

Value

partitions: A data frame naming the index of the partition and the range x over which the partition extends.
data: A data frame containing the partition index (parts), the unique values of x, the average of y and the range of the partition.

Details

kparts finds the best contiguous partitions for x by minimizing the sum of squares of y.

The sum of squares for a unique value of x cannot be partitioned, which has the effect of weighting unique values of x by the number observations at those values. Using nblind = "FALSE" cause kparts to ignore the number of observations and treat all x values as equally weighted.

kparts can take a long time to process datasets with large numbers of unique x values. To gain efficiency, pre-processing vector x by binning is recommended.

Examples

Run this code

  # plot readmission rates against age. 
  data(ipadmits)
  attach(ipadmits)
  ipadmits.summary = data.frame("AvgReadmission" = tapply(ipadmits$isReadmission
                                                          ,ipadmits$Age
                                                          ,mean)
                                ,"AvgCost" = tapply(ipadmits$cost
                                                    ,ipadmits$Age
                                                    ,mean))
  plot(ipadmits.summary$AvgReadmission,xlab = "Age",ylab = "AvgReadmission")
  
  
  # find the best partitions of age against readmission rate. 
  # run kparts with 4 trials with 5 partitions
  kp = kparts(x = ipadmits$Age,y = ipadmits$isReadmission,parts = 5,trials = 4)
  # list value range for each partition
  kp$partitions
  plot(kp)
  # run with 7 partitions and ignore number of samples per age
  # when computing error
  kp = kparts(ipadmits$Age,ipadmits$isReadmission,parts = 7,trials = 5,nblind = TRUE)
  kp$partitions
  plot(kp)
  detach(ipadmits)

Run the code above in your browser using DataLab