is.fasta(file)
grep.file(file, pattern = "", y = NULL, ignore.case = TRUE,
startswith = ">", lines = NULL, grep = "grep")
read.fasta(file, i = NULL, ret = "count", lines = NULL,
ihead = NULL, pnff = FALSE)
splitline(line, length)
trimfas(file, start, stop)
grep
command.grep.file
returns a numeric vector. read.fasta
returns a list of sequences or lines (for ret
equal to seq or fas, respectively), or a data frame with amino acid compositions of proteins (for ret
equal to count) with columns corresponding to those in thermo$protein
.is.fasta
checks if a file is in FASTA format. A very simple test is performed: if either of the first two lines of the file starts with >, then the function returns TRUE, otherwise it returns FALSE. grep.file
is used to search for entries in a FASTA file. It returns the line numbers of the matching FASTA headers. It takes a search term in pattern
and optionally a term to exclude in y
. The ignore.case
option is passed to grep
, which does the work of finding lines that match. Only lines that start with the expression in startswith
are searched; the default setting reflects the format of the header line for each sequence in a FASTA file.
If y
is NULL and a supported operating system is identified, the operating system's grep function (or other specified in the grep
argument) is applied directly to the file instead of R's grep
. This avoids having to read the file into R using readLines
. If the lines from the file were obtained in a preceding operation, they can be supplied to this function in the lines
argument.
read.fasta
is used to retrieve entries from a FASTA file. The line numbers for the headers of the desired sequences are passed to the function in i
(they can be generated using grep.file
). The function returns various formats depending on the value of ret
; the default count returns a dataframe of amino acid counts (the dataframe can be given to add.protein
in order to add the proteins to thermo$protein
), seq returns a list of sequences, and fas returns a list of lines extracted from the FASTA file, including the headers (this can be used e.g. to generate a new FASTA file with only the selected sequences). Similarly to grep.file
, this function utilizes the OS's grep on supported operating systems in order to identify the header lines as well as cat to read the file, otherwise readLines
and R's substr
are used to read the file and locate the header lines. lines
, if it is given, bypasses the reading of the file and also overrides the use of the OS's tools. If the line numbers of the header lines were previously determined, they can be supplied in ihead
.
splitline
takes a single character object (the line
) and splits it into multiple lines of the given length (the last line can be shorter than this). It returns a character object that contains the lines. This function is utilized by trimfas
, which extracts the specified positions from a (usually) aligned FASTA file. The length of the lines used by trimfas
is equal to the length of the first sequence line in the given file
.
grep.file
and read.fasta
, consider using the iprotein
arugment of affinity
to speed things up; for an example see the help page for revisit
.# basic use of splitline
splitline("abcdefghijklmnopqrstuvwxyz",10)
# get the first ten positions of each of the sequences
# in a FASTA file
f <- system.file("extdata/fasta/HTCC1062.faa.xz",package="CHNOSZ")
fnew <- trimfas(f,1,10)
Run the code above in your browser using DataLab