The function prepareCGOneFactorData
reads in a data frame and
settings
in order to create a
cgOneFactorData
object. The created object is designed to have exploratory and
fit methods applied to it.
prepareCGOneFactorData(dfr, format = "listed", analysisname = "",
endptname = "", endptunits = "", logscale = TRUE, zeroscore = NULL,
addconstant = NULL, rightcensor = NULL, leftcensor = NULL, digits = NULL,
refgrp = NULL, stamps = FALSE)
A cgOneFactorData
object is returned, with the following slots:
The original input data frame that is the specified value of the
dfr
argument in the function call.
Processed version of the input data frame, which will be used for the various evaluation methods.
A list version of the input data frame, which will only
differ from the dfr
value if the input data frame was specified in the
groupcolumns
format.
Boolean TRUE
or FALSE
on whether there are any
censored data observations.
A list of properties associated with the data frame:
analysisname
Drawn from the input argument value of
analysisname
.
endptname
Drawn from the input argument value of
endptname
.
endptunits
Drawn from the input argument value of
endptunits
.
endptscale
Has the value of "log"
if
logscale=TRUE
and "original"
if
logscale=FALSE
.
zeroscore
Has the value of NULL
if the input argument
was NULL
. Otherwise has the derived (from
zeroscore="estimate"
)
or specified numeric value.
addconstant
Has the value of NULL
if the input argument
was NULL
. Otherwise has the specified numeric value.
rightcensor
Has the value of the input argument
rightcensor
or is set to NULL
if no censored
observations are determined.
leftcensor
Has the value of the input argument
leftcensor
or
is set to NULL
if no censored
observations are determined.
digits
Has the value of the input argument
digits
or is set to the determined value of digits from the
input data. Will be an integer of 0, 1, 2, 3, or 4.
grpnames
Determined from the single factor identified of the
group names. The order is determined by their first occurence in the
input data frame dfr
.
refgrp
Drawn from the input argument of refgrp
.
stamps
Drawn from the input argument of stamps
.
A valid data frame, see the format
argument.
Default value of "listed"
. Either "listed"
or
"groupcolumns"
must be used. Abbreviations of "l"
or "g"
, respectively,
or otherwise sufficient matching values can be used:
"listed"
At least two columns, with the factor levels in the first column and response values in the second column. If there is censored data, then two or three more columns are required, see the Details Input Data Frame section below.
"groupcolumns"
Each column must represent a group. Each
group is a unique level of the one factor, so the levels of the factor
make up the column headers. The values in the data frame are for
the response. If the groups have unequal sample sizes, the empty
cells within the data frame can have NA
's or be left
blank. Censored values can be represented; see the Details Input
Data Frame section below. Otherwise, any character data will be coerced to
numeric data with possibly undesirable results.
Optional, a character text or
math-valid expression that will be set for
default use in graph title and table methods. The default
value is the empty ""
.
Optional, a character text or math-valid expression
that will be set for default use as the y-axis label of graph
methods, and also used for table methods. The default
value is the empty ""
.
Optional, a character text or math-valid
expression that can be used in combination with the endptname
argument.
Parentheses are
automatically added to this input, which will be added to the end
of the endptname character value or expression. The default
value is the empty ""
.
Apply a log-transformation to the data for
evaluations. The default value is TRUE
.
Optional,
replace response values of zero with a derived or specified
numeric value, as an approach to overcome the presence of zeroes
when evaluation in the
logarithmic scale (logscale=TRUE
) is specified. The default value
is NULL
. To derive a score value to replace zero,
"estimate"
can be specified, see Details below on the algorithm used.
Optional,
add a numeric constant to all response values, as an
approach to overcome the presence of zeroes when evaluation in the
logarithmic scale logscale=TRUE
is desired. The default value is
NULL
. positive numeric value can be specified to be added, or a "simple"
algorthm specified to estimate a value to add. See Details secion
below on the algorithm used.
Optional, can be specified with a numeric
value where any value equal to or greater will be regarded as
right censored in the evaluation. The value of TRUE
can be
used to coerce a binary status variable in the data frame to be
right censored for its values.The default value is NULL
.
See the Details Input Data Frame section
below for specifications and consequences.
Optional, can be specified with a numeric
value where any value equal to or lesser will be regarded as
left censored in the evaluation. The value of TRUE
can be
used to coerce a binary status variable in the data frame to be
right censored for its values. The default value is NULL
.
See the Details Input Data Frame section
below for specifications and consequences.
Optional, for output display purposes in graphs
and table methods, values will be rounded to this numeric
value. Only the integers of 0, 1, 2, 3, and 4 are accepted. No
rounding is done during any calculations. The default value is
NULL
, which will examine each individual data value and choose the
one that has the maximum number of digits after any trailing
zeroes are ignored. The max number of digits will be 4.
Optional, specify one of the factor levels to be the
“reference group”, such as a “control” group.
The default value is NULL
,
which will just use the first level determined in the data frame.
Optional, specify a time stamp in graphs, along
with cg package
version identification. The default value is FALSE
.
Bill Pikounis [aut, cre, cph], John Oleynick [aut], Eva Ye [ctb]
The input data frame dfr
can be of the format
"listed"
or "groupcolumns"
. Another distinguishing
characteristic is whether or not it contains censored data
representations.
Censored observations can be represented by <
for
left-censoring
and >
for
right-censoring. The <
value refers to values less than or equal
to a numeric value. For example, <0.76
denotes a left-censored
value of 0.76
or less. Similarly, >2.02
denotes a value of 2.02 or greater for
a right-censored value. There must be no space between the direction
indicator and the numeric value. These representations can be used in
either the listed
or groupcolumns
formats for dfr
.
No interval-censored representations are currently handled when
format="groupcolumns"
.
If format="groupcolumns"
for dfr
is specified, then the
number of columns must equal the number of groups, and any censored
values must follow the <
and >
representations.
The individual group values are of mode character, since any
censored values will be represented for example as <0.76
or
>2.02
. If any of the groups have less number of
observations than any others, i.e. there are unequal sample sizes,
then the corresponding "no data" cells in the data frame need to
contain empty quote ""
values.
If format="listed"
for dfr
is specified, then there may be
anywhere from two to four columns for an input data frame.
The first column has the group levels to define the
factor, and the second column contains the response values. Censored
representations of <
and >
can be used here. One or
both of
rightcensor
or leftcensor
may also be specified as a
number. If
a number is specified for rightcensor
, then all values in
the second column equal to this value will be processed as
right-censored. Analogously, if
a number is specified for leftcensor
, then all values in
the second column equal to this value will be processed as
left-censored. WARNING: This should be used cautiously to make sure the
equality occurs as desired. This convention is designed for simple
Type I censoring scenarios.
Like the two column case, the first column has
the group
levels to define the
factor, and the second column contains the response values, which will
all be coerced to numeric. Any censoring information must be specified
in the third column. Borrowing the convention of Surv
from the survival package, 0
=right censored, 1
=no censoring, and
2
=left censored. If rightcensor=NULL
and
leftcensor=NULL
are left as defaults in the call, and
values of 0, 1, and 2 are all represented, then the
processing will create a suitable data frame dfru
for
modeling that the canonical survreg
function understands.
However, if 0 and 1 are the only specified values
in the third censoring status column, then one of
rightcensor=TRUE
or leftcensor=TRUE
must be specified,
but NOT both, or an error message will occur. A column of all 1's or
all 0's will also raise an error message.
Like the two column case, the first column has
the group
levels to define the
factor. The second and third columns need to have numeric response
information, and the fourth column needs to have censoring
status. This is the most general representation, where any combination
of left-censoring, right-censoring, and interval-censoring is
permitted. The rightcensor
and leftcensor
input
arguments are ignored and set to NULL
. IMPORTANT: The
convention of Surv
from the survival package, 0=right censored, 1=no censoring, and
2=left censored, 3=interval censored, and
type="interval"
,
is followed. For status=0, 1, and 2, the second and
third columns match in value, so that the status variable in the
fourth column distinguishes the lower and upper bounds for the
right-censored (0) and left-censored (2) cases.
For status=3, the two values differ to
define the interval boundaries. The
processing will create a suitable data frame dfru
for
modeling that the canonical survreg
and survfit
functions from the survival package understand.
If zeroscore="estimate"
is specified, a number
close to zero is derived to replace all zeroes for subsequent
log-scale analyses. A spline fit (using spline
and
method="natural"
)
of the log of the
response vector on the original response vector is performed. The
zeroscore is then derived from the log-scale value of the spline curve at the original
scale value of zero. This approach comes from the concept of
arithmetic-logarithmic scaling discussed in Tukey, Ciminera, and
Heyse (1985).
If addconstant="simple"
or
addconstant="VR"
is specified, a number is derived and added
to all response values.
"simple"
Taken from the "white" book on S (Chambers and Hastie, 1992),
page 68. The range (max - min
) of the response values
is multiplied by 0.0001
to derive the number to add to all the
response values.
"VR"
Based on the logtrans
function discussed in Venables and Ripley
(2002), pages 171-172 and available in the MASS
package. The algorithm applies a Box-Cox profile likelihood
approach with a log scale translation model.
Tukey, J.W., Ciminera, J.L., and Heyse, J.F. (1985). "Testing the Statistical Certainty of a Response to Increasing Doses of a Drug," Biometrics, Volume 41, 295-301.
Chambers, J.M, and Hastie, T.R. (1992), Statistical Modeling in S. Chapman & Hall/CRC.
Venables, W. N., and Ripley, B. D. (2002), Modern Applied Statistics with S. Fourth edition. Springer.
Surv
, canine
,
gmcsfcens
,
prepare
data(canine)
canine.data <- prepareCGOneFactorData(canine, format="groupcolumns",
analysisname="Canine",
endptname="Prostate Volume",
endptunits=expression(plain(cm)^3),
digits=1, logscale=TRUE, refgrp="CC")
## Censored Data
data(gmcsfcens)
gmcsfcens.data <- prepareCGOneFactorData(gmcsfcens, format="groupcolumns",
analysisname="cytokine",
endptname="GM-CSF (pg/ml)",
logscale=TRUE)
Run the code above in your browser using DataLab