Learn R Programming

ngramr (version 1.7.4)

ngram: Get n-gram frequencies

Description

ngram downloads data from the Google Ngram Viewer website and returns it in a tibble.

Usage

ngram(
  phrases,
  corpus = "eng_2019",
  year_start = 1800,
  year_end = 2020,
  smoothing = 3,
  case_ins = FALSE,
  aggregate = FALSE,
  count = FALSE,
  drop_corpus = FALSE,
  drop_parent = FALSE,
  drop_all = FALSE,
  type = FALSE
)

Arguments

phrases

vector of phrases, with a maximum of 12 items

corpus

Google corpus to search (see Details for possible values)

year_start

start year, default is 1800. Data available back to 1500.

year_end

end year, default is 2008

smoothing

smoothing parameter, default is 3

case_ins

Logical indicating whether to force a case insensitive search. Default is FALSE.

aggregate

Sum up the frequencies for ngrams associated with wildcard or case insensitive searches. Default is FALSE.

count

Default is FALSE.

drop_corpus

When a corpus is specified directly with the ngram (e.g dog:eng_fiction_2012) should the corpus be used retained in the phrase column of the results. Default is FALSE.

drop_parent

Drop the parent phrase associated with a wildcard or case-insensitive search. Default is FALSE.

drop_all

Delete the suffix "(All)" from aggregated case-insensitive searches. Default is FALSE.

type

Include the Google return type (e.g. NGRAM, NGRAM_COLLECTION, EXPANSION) from result set. Default is FALSE.

Value

ngram returns an object of class "ngram", which is a tidyverse tibble enriched with attributes reflecting some of the parameters used in the Ngram Viewer query.

Details

Google generated two datasets drawn from digitised books in the Google Books collection. One was generated in July 2009, the second in July 2012 and the third in 2019. Google is expected to update these datasets as book scanning continues.

This function provides the annual frequency of words or phrases, known as n-grams, in a sub-collection or "corpus" taken from the Google Books collection.The search across the corpus is case-sensitive.

Note that the tag option is no longer available. Tags should be specified directly in the ngram string (see examples).

Below is a list of available corpora.

Corpus Corpus Name
eng_us_2019 American English 2019
eng_us_2012 American English 2012
eng_us_2009 American English 2009
eng_gb_2019 British English 2019
eng_gb_2012 British English 2012
eng_gb_2009 British English 2009
chi_sim_2019 Chinese 2019
chi_sim_2012 Chinese 2012
chi_sim_2009 Chinese 2009
eng_2019 English 2019
eng_2012 English 2012
eng_2009 English 2009
eng_fiction_2019 English Fiction 2019
eng_fiction_2012 English Fiction 2012
eng_fiction_2009 English Fiction 2009
eng_1m_2009 Google One Million
fre_2019 French 2019
fre_2012 French 2012
fre_2009 French 2009
ger_2019 German 2019
ger_2012 German 2012
ger_2009 German 2009
heb_2019 Hebrew 2019
heb_2012 Hebrew 2012
heb_2009 Hebrew 2009
spa_2019 Spanish 2019
spa_2012 Spanish 2012
spa_2009 Spanish 2009
rus_2019 Russian 2019
rus_2012 Russian 2012
rus_2009 Russian 2009
ita_2019 Italian 2019
ita_2012 Italian 2012

The Google Million is a sub-collection of Google Books. All are in English with dates ranging from 1500 to 2008. No more than about 6,000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).

See http://books.google.com/ngrams/info for the full Ngram syntax.

Examples

Run this code
# NOT RUN {
ngram(c("mouse", "rat"), year_start = 1950)
ngram(c("blue_ADJ", "red_ADJ"))
ngram(c("_START_ President Roosevelt", "_START_ President Truman"), year_start = 1920)
# }

Run the code above in your browser using DataLab