karlinMonteCarlo: Monte Carlo - Karlin [p-value]

Description

Estimates p-value of the local score based on a Monte Carlo estimation of Gumble parameters from simulations of smaller sequences with same distribution. Appropriate for great sequences with length > 10^3, for i.i.d and markovian sequence models.

Usage

karlinMonteCarlo(
  local_score,
  sequence_length,
  simulated_sequence_length,
  FUN,
  ...,
  numSim = 1000,
  plot = TRUE,
  keepSimu = FALSE
)
karlinMonteCarlo_double(
  local_score,
  sequence_length,
  simulated_sequence_length,
  FUN,
  ...,
  numSim = 1000,
  plot = TRUE,
  keepSimu = FALSE
)

Value

If keepSimu is FALSE, returns a list containing:

`p_value`	Probability to obtain a local score with a value greater or equal to the parameter `local_score`

`K*`	Parameter \(K^*\) defined in Karlin and Dembo (1990)

`lambda`	Parameter \(\lambda\) defined in Karlin and Dembo (1990)

If keepSimu is TRUE, returns a list containing:

`p_value`	Probability to obtain a local score with a value greater or equal to the parameter `local_score`

`K*`	Parameter \(K^*\) defined in Karlin and Dembo (1990)

`lambda`	Parameter \(\lambda\) defined in Karlin and Dembo (1990)
`localScores`	Vector of size `numSim` containing the simulated local scores for sequence size of `simulated_sequence_length`

Arguments

local_score: local score observed in a sequence.
sequence_length: length of the sequence
simulated_sequence_length: length of simulated sequences produced by
FUN: function to simulate similar sequences with.
...: parameters for FUN
numSim: number of sequences to create for estimation FUN
plot: boolean value if to display plots for cumulated function, density and linearization of cumulative density function
keepSimu: Boolean, default to FALSE. If TRUE, the simulated local scores are returned as the localScores element of the output list.

Details

The length of the simulated sequences is an argument specific to the function provided for simulation. Thus, it has to be provided also in the parameter simulated_sequence_length in the arguments of the "Monte Carlo - Karlin" function. It is a crucial detail as it influences precision and computation time of the result. Note that to get an appropriate estimation, the given average score must be non-positive. Be careful that the parameters names of the function FUN should differ from those of karlinMonteCarlo function.
Methods - Parameters \(K^\star\) and \(\lambda\) of Karlin and Dembo (1990) are estimated by a linear regression on the log(-log(cumulative distribution function of the local scores)) on shorter sequences (size simulated_sequence_length). The formula used are : \(\hat{\lambda} = -\hat{b}\) and \(\hat{K^\star} = exp(\hat{a})/simulated\_sequence\_length\) where \(\hat{a}\) is the intercept of the regression and \(\hat{b}\) is the slope of the regression. Then p-value is given by \(p = exp(-K^\star * exp(-\lambda*x ))\) where \(x = local\_score - \log(sequence\_length)/\lambda\).
The density plot produced by plot == TRUE depends on the type of the simulated local scores: if they are integer, a barplot of relative frequency is used, else plot(density(...)) is used.
This function calls localScoreC which type of the output depends on the type of the input. To be efficient, be aware to use a simulating function FUN that return a vector of adequate type ("integer" or "numeric"). Warning: in R, typeof(c(1,3,4,10)) == "double". You can set a type of a vector with mode() or as.integer() functions for example.
karlinMonteCarlo_double() is deprecated. At this point, it is just a call to karlinMonteCarlo() function.

Examples

Run this code

# \donttest{
mySeq <- sample(-7:6, replace = TRUE, size = 100000)
#MonteCarlo taking random sample from the input sequence itself
karlinMonteCarlo(local_score = 160, sequence_length = 100000,
               simulated_sequence_length = 1000,
               FUN = function(x, sim_length) {
                        return(sample(x = x,
                               size = sim_length,
                               replace = TRUE))
                     },
               x = mySeq,
               sim_length = 1000,
               numSim = 1000)
# }
# \donttest{
#Markovian example (longer computation)
MyTransMat_reels <-  matrix(c(0.3, 0.1, 0.1, 0.1, 0.4,
                              0.2, 0.2, 0.1, 0.2, 0.3,
                              0.3, 0.4, 0.1, 0.1, 0.1,
                              0.3, 0.3, 0.1, 0.2, 0.1,
                              0.2, 0.1, 0.2, 0.4, 0.1),
                              ncol = 5, byrow=TRUE)
karlinMonteCarlo(local_score = 18.5, sequence_length = 200000,
                 simulated_sequence_length = 1500,
                 FUN = transmatrix2sequence,
                 matrix = MyTransMat_reels,
                 score =c(-1.5,-0.5,0,0.5,1), length = 1500,
                 plot=TRUE, numSim = 1500)
# }