entrez_summary: Get summaries of objects in NCBI datasets from a unique ID

Description

The NCBI offer two distinct formats for summary documents. Version 1.0 is a relatively limited summary of a database record based on a shared Document Type Definition. Version 1.0 summaries are only available as XML and are not available for some newer databases Version 2.0 summaries generally contain more information about a given record, but each database has its own distinct format. 2.0 summaries are available for records in all databases and as JSON and XML files. As of version 0.4, rentrez fetches version 2.0 summaries by default and uses JSON as the exchange format (as JSON object can be more easily converted into native R types). Existing scripts which relied on the structure and naming of the "Version 1.0" summary files can be updated by setting the new version argument to "1.0".

Usage

entrez_summary(
  db,
  id = NULL,
  web_history = NULL,
  version = c("2.0", "1.0"),
  always_return_list = FALSE,
  retmode = NULL,
  config = NULL,
  ...
)

Arguments

character Name of the database to search for

vector with unique ID(s) for records in database db. In the case of sequence databases these IDs can take form of an NCBI accession followed by a version number (eg AF123456.1 or AF123456.2)

web_history

A web_history object

version

either 1.0 or 2.0 see above for description

always_return_list

logical, return a list of esummary objects even when only one ID is provided (see description for a note about this option)

retmode

either "xml" or "json". By default, xml will be used for version 1.0 records, json for version 2.0.

config

vector configuration options passed to httr::GET

…

character Additional terms to add to the request, see NCBI documentation linked to in references for a complete list

Value

A list of esummary records (if multiple IDs are passed and always_return_list if FALSE) or a single record.

file XMLInternalDocument xml file containing the entire record returned by the NCBI.

Details

By default, entrez_summary returns a single record when only one ID is passed and a list of such records when multiple IDs are passed. This can lead to unexpected behaviour when the results of a variable number of IDs (perhaps the result of entrez_search) are processed with an apply family function or in a for-loop. If you use this function as part of a function or script that generates a variably-sized vector of IDs setting always_return_list to TRUE will avoid these problems. The function extract_from_esummary is provided for the specific case of extracting named elements from a list of esummary objects, and is designed to work on single objects as well as lists.

References

https://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_ESummary_

Examples

Run this code

# NOT RUN {
 pop_ids = c("307082412", "307075396", "307075338", "307075274")
 pop_summ <- entrez_summary(db="popset", id=pop_ids)
 extract_from_esummary(pop_summ, "title")
 
 # clinvar example
 res <- entrez_search(db = "clinvar", term = "BRCA1", retmax=10)
 cv <- entrez_summary(db="clinvar", id=res$ids)
 cv
 extract_from_esummary(cv, "title", simplify=FALSE)
 extract_from_esummary(cv, "trait_set")[1:2] 
 extract_from_esummary(cv, "gene_sort") 
# }

Run the code above in your browser using DataLab