Learn R Programming

GenoPop (version 1.0.0)

Dxy: Dxy

Description

This function calculates the average number of nucleotide differences per site (Dxy) between two populations from a VCF file (Nei & Li, 1979 (https://doi.org/10.1073/pnas.76.10.5269)). Handling missing alleles at one site is equivalent to Korunes & Samuk, 2021 ( https://doi.org/10.1111/1755-0998.13326). The function calculates the number of monomorphic sites using the sequence length and the number of variants in the VCF file. This assumes, that all sites not present in the VCF file are invariant sites, which will underestimate the metric, because of commonly done (and necessary) variant filtering. However, otherwise this calculation would only work with VCF files that include all monomorphic sites, which is quite unpractical for common use cases and will increase computational demands significantly. If you happen to know the number of filtered our sites vs the number of monomorphic sites, please use the number of monomorphic + the number of polymorphic (number of variants in your VCF) sites as the sequence length to get the most accurate estimation of the metric. (This does not work for the window mode of this function, which assumes the sequence length to be the window size.) For batch processing, it uses process_vcf_in_batches. For windowed analysis, it uses a similar approach tailored to process specific genomic windows (process_vcf_in_windows).

Usage

Dxy(
  vcf_path,
  pop1_individuals,
  pop2_individuals,
  seq_length,
  batch_size = 10000,
  threads = 1,
  write_log = FALSE,
  logfile = "log.txt",
  window_size = NULL,
  skip_size = NULL
)

Value

In batch mode (no window_size or skip_size provided): The average number of nucleotide substitutions per site between the individuals of two populations (Dxy). In window mode (window_size and skip_size provided): A data frame with columns 'Chromosome', 'Start', 'End', and 'Dxy', representing the average nucleotide differences within each window.

Arguments

vcf_path

Path to the VCF file.

pop1_individuals

Vector of individual names belonging to the first population.

pop2_individuals

Vector of individual names belonging to the second population.

seq_length

Length of the sequence in number of bases, including monomorphic sites (used in batch mode only).

batch_size

The number of variants to be processed in each batch (used in batch mode only, default of 10,000 should be suitable for most use cases).

threads

Number of threads to use for parallel processing.

write_log

Logical, indicating whether to write progress logs.

logfile

Path to the log file where progress will be logged.

window_size

Size of the window for windowed analysis in base pairs (optional). When specified, skip_size must also be provided.

skip_size

Number of base pairs to skip between windows (optional). Used in conjunction with window_size for windowed analysis.

Examples

Run this code
vcf_file <- system.file("tests/testthat/sim.vcf.gz", package = "GenoPop")
index_file <- system.file("tests/testthat/sim.vcf.gz.tbi", package = "GenoPop")
pop1_individuals <- c("tsk_0", "tsk_1", "tsk_2")
pop2_individuals <- c("tsk_3", "tsk_4", "tsk_5")
total_sequence_length <- 999299  # Total length of the sequence
# Batch mode example
dxy_value <- Dxy(vcf_file, pop1_individuals, pop2_individuals, total_sequence_length)
# Window mode example
dxy_windows <- Dxy(vcf_file, pop1_individuals, pop2_individuals, seq_length = total_sequence_length,
                   window_size = 100000, skip_size = 50000)

Run the code above in your browser using DataLab