The FIT package is an R
implementation
of a class of transcriptomic models that
relates gene expressions of plants and weather conditions to which
the plants are exposed.
(The reader is referred to [Nagano et al.] for the detail of
the class of models concerned.)
By providing
(a) gene expression profiles of plants brought up in a field condition,
and (b) the relevant weather history (temperature etc.) of the said field,
the user of the package is able to
(1) construct optimized models (one for each gene) for their expressions,
and
(2) use them to predict the expressions for another weather history
(possibly in a different field).
Below, we briefly explain
the construction of the optimized models (``training phase'')
and the way to use them to make predictions (``prediction phase'').
Model training phase
The model of [Nagano et al.] belongs to the class of statistical models
called ``linear models''
and are specified by a set of ``parameters'' and
``(linear regression) coefficients''.
The former are used to convert weather conditions to
the ``input variables'' for a regression, and the latter are then
multiplied to the input variables to form the expectation values
for the gene expressions.
The reader is referred to the original article [Nagano et al.]
for the formulas for the input variables.
(See also [Iwayama] for a review.)
The training phase consists of three stages:
Init
: fixes the initial model parameters
Optim
: optimizes the model parameters
Fit
: fixes the linear regression coefficients
The user can configure the training phase
through a custom data structure (``recipe''),
which can be constructed by using the utility function
FIT::make.recipe()
.The role of the first stage Init
is to fix the initial values
for the model parameters from which the parameter optimization is performed.
At the moment two methods, 'manual'
and 'gridsearch'
,
are implemented.
With the 'manual'
method the user can simply specify the set of
initial values that he thinks is promising.
For the 'gridsearch'
method the user discretizes
the parameter space to a grid by providing
a finite number of candidate values for each parameter.
FIT then performs a search over the grid
for the ``best'' combinations of the initial parameters.
The second stage Optim
is the main step of the model training,
and FIT tries to gradually improve the model parameters
using the Nelder-Mead method.
This stage could be run one or more times where each can be run
using the method 'none'
, 'lm'
or 'lasso'
.
The 'none'
method passes the given parameter as-is
to the next method in the Optim
pipeline or to the next stage Fit
.
(Basically, the method is there so that the user can skip the entire
Optim
stage, but the method could be used for slightly warming-up the CPU as well.)
The 'lm'
method uses the a simple (weighted) linear regression to
guide the parameter optimization. That is, FIT
first computes the ``input variables'' from the current parameters and
associated weather data, and then finds the set of linear coefficients
that best explains the ``output variables'' (gene expressions).
Finally, the quadratic residual is used as the measure for the
error and is fed back to the Nelder-Mead method.
The 'lasso'
method is similar to the 'lm'
method
but uses the (weighted) Lasso regression
(``linear'' regression with an L1-regularization for the regression coefficients)
instead of the simple linear regression.
FIT uses the glmnet package to perform
the Lasso regression and the strength of the L1-regularization
is fixed via a cross validation. (See cv.glmnet()
from the glmnet
package.
The Lasso regression is said to suppress irrelevant input variables automatically
and tends to create models with better prediction ability.
On the other hand, 'lasso'
runs considerably slower than 'lm'
.
For example, passing a vector c('lm', 'lasso')
to the
argument optim
(of make.recipe()
) creates a recipe
that instructs the Optim
stage to
(1) first optimize using the 'lm'
method,
(2) and then fine tunes the parameters using the 'lasso'
method.
After fixing the model parameters in the Optim
stage,
the Fit
stage can be used to fix the linear coefficients
of the models.
Here, either 'fit.lm'
or 'fit.lasso'
can be used
to find the ``best'' coefficients, the main difference being that
the coefficients are penalized by an L1-norm for the latter.
Note that it is perfectly okay to use 'fit.lasso'
for
the parameters optimized using 'lm'
.
In order to prepare for the possibly huge variations
of expression data as measured by RNA-seq,
FIT provides a way to weight regression penalties from each sample
with different weights as in
sum_{s in samples} (weight_s) (error_s)^2
.
Prediction phase
For each gene, the trained model of the previous subsection
can be thought of as a black box that maps
the field conditions (weather data),
to which a plant containing the gene is exposed,
to its expected expression.
FIT provides a simple function
FIT::predict()
that does just this.
FIT::predict()
takes as its argument
a list of pretrained models
as well as actual/hypothetical plant sample attributes and weather data,
and returns the predicted values of gene expressions.
When there is a set of actually measured expressions,
an associated function FIT::prediction.errors()
)
can be used to check the validity of the predictions made by
the models.