Read a minimap/minimap2 .paf file including optional tagged extra fields. The optional fields will be parsed into a tidy format, one column per tag.
read_paf(
file,
max_tags = 20,
col_names = def_names("paf"),
col_types = def_types("paf"),
...
)
tibble
Either a path to a file, a connection, or literal data (either a single string or a raw vector).
Files ending in .gz
, .bz2
, .xz
, or .zip
will
be automatically uncompressed. Files starting with http://
,
https://
, ftp://
, or ftps://
will be automatically
downloaded. Remote gz files can also be automatically downloaded and
decompressed.
Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with I()
, be a string
containing at least one new line, or be a vector containing at least one
string with a new line.
Using a value of clipboard()
will read from the system clipboard.
maximum number of optional fields to include
column names to use. Defaults to def_names("gff3")
(see def_names
).
column types to use. Defaults to def_types("gff3")
(see def_types
).
additional parameters, passed to read_tsv
Because readr::read_tsv
expects a fixed number of columns, but in .paf the
number of optional fields can differ among records, read_paf
tries to read
at least as many columns as the longest record has (max_tags
). The
resulting warnings for each record with fewer fields of the form "32 columns
expected, only 22 seen" should thus be ignored.
From the minimap2 manual
+----+--------+---------------------------------------------------------+ |Col | Type | Description | +----+--------+---------------------------------------------------------+ | 1 | string | Query sequence name | | 2 | int | Query sequence length | | 3 | int | Query start coordinate (0-based) | | 4 | int | Query end coordinate (0-based) | | 5 | char | ‘+’ if query/target on the same strand; ‘-’ if opposite | | 6 | string | Target sequence name | | 7 | int | Target sequence length | | 8 | int | Target start coordinate on the original strand | | 9 | int | Target end coordinate on the original strand | | 10 | int | Number of matching bases in the mapping | | 11 | int | Number bases, including gaps, in the mapping | | 12 | int | Mapping quality (0-255 with 255 for missing) | +----+--------+---------------------------------------------------------+
+----+------+-------------------------------------------------------+ |Tag | Type | Description | +----+------+-------------------------------------------------------+ | tp | A | Type of aln: P/primary, S/secondary and I,i/inversion | | cm | i | Number of minimizers on the chain | | s1 | i | Chaining score | | s2 | i | Chaining score of the best secondary chain | | NM | i | Total number of mismatches and gaps in the alignment | | MD | Z | To generate the ref sequence in the alignment | | AS | i | DP alignment score | | ms | i | DP score of the max scoring segment in the alignment | | nn | i | Number of ambiguous bases in the alignment | | ts | A | Transcript strand (splice mode only) | | cg | Z | CIGAR string (only in PAF) | | cs | Z | Difference string | | dv | f | Approximate per-base sequence divergence | +----+------+-------------------------------------------------------+
From https://samtools.github.io/hts-specs/SAMtags.pdf type may be one of A (character), B (general array), f (real number), H (hexadecimal array), i (integer), or Z (string).