Learn R Programming

GenoPop (version 1.0.0)

GenoPop_Impute: GenoPop-Impute

Description

Performs imputation of missing genomic data in batches using the missForest (Stekhoven & Bühlmanm, 2012) algorithm. This function reads VCF files, divides it into batches of a fixed number of SNPs, applies the missForest algorithm to each batch, and writes the results to a new VCF file, which will be returned bgzipped and tabix indexed. The choice of the batch size is critical for balancing accuracy and computational demand. We found that a batch size of 500 SNPs is the most accurate for recombination rates typical of mammalians. For on average higher recombination rates (> 5 cM/Mb) we recommend a batch size of 100 SNPs.

Usage

GenoPop_Impute(
  vcf_path,
  output_vcf,
  batch_size = 1000,
  maxiter = 10,
  ntree = 100,
  threads = 1,
  write_log = FALSE,
  logfile = "log.txt"
)

Value

Path to the output VCF file with imputed data.

Arguments

vcf_path

Path to the input VCF file.

output_vcf

Path for the output VCF file with imputed data.

batch_size

Number of SNPs to process per batch (default: 500).

maxiter

Number of improvement iterations for the random forest algorithm (default: 10).

ntree

Number of decision trees in the random forest (default: 100).

threads

Number of threads used for computation (default: 1).

write_log

If TRUE, writes a log file of the process (advised for large datasets).

logfile

Path to the log file, used if write_log is TRUE.

Examples

Run this code
 vcf_file <- system.file("tests/testthat/sim_miss.vcf.gz", package = "GenoPop")
 index_file <- system.file("tests/testthat/sim_miss.vcf.gz.tbi", package = "GenoPop")
 output_file <- tempfile(fileext = ".vcf")
 GenoPop_Impute(vcf_file, output_vcf = output_file, batch_size = 500)

Run the code above in your browser using DataLab