ngram: Get n-gram frequencies

Description

ngram downloads data from the Google Ngram Viewer website and returns it in a tibble.

Usage

ngram(
  phrases,
  corpus = "eng_2019",
  year_start = 1800,
  year_end = 2020,
  smoothing = 3,
  case_ins = FALSE,
  aggregate = FALSE,
  count = FALSE,
  drop_corpus = FALSE,
  drop_parent = FALSE,
  drop_all = FALSE,
  type = FALSE
)

Arguments

phrases

vector of phrases, with a maximum of 12 items

corpus

Google corpus to search (see Details for possible values)

year_start

start year, default is 1800. Data available back to 1500.

year_end

end year, default is 2008

smoothing

smoothing parameter, default is 3

case_ins

Logical indicating whether to force a case insensitive search. Default is FALSE.

aggregate

Sum up the frequencies for ngrams associated with wildcard or case insensitive searches. Default is FALSE.

count

Default is FALSE.

drop_corpus

When a corpus is specified directly with the ngram (e.g dog:eng_fiction_2012) should the corpus be used retained in the phrase column of the results. Default is FALSE.

drop_parent

Drop the parent phrase associated with a wildcard or case-insensitive search. Default is FALSE.

drop_all

Delete the suffix "(All)" from aggregated case-insensitive searches. Default is FALSE.

type

Include the Google return type (e.g. NGRAM, NGRAM_COLLECTION, EXPANSION) from result set. Default is FALSE.

Value

ngram returns an object of class "ngram", which is a tidyverse tibble enriched with attributes reflecting some of the parameters used in the Ngram Viewer query.

Details

Google generated two datasets drawn from digitised books in the Google Books collection. One was generated in July 2009, the second in July 2012 and the third in 2019. Google is expected to update these datasets as book scanning continues.

This function provides the annual frequency of words or phrases, known as n-grams, in a sub-collection or "corpus" taken from the Google Books collection.The search across the corpus is case-sensitive.

Note that the tag option is no longer available. Tags should be specified directly in the ngram string (see examples).

Below is a list of available corpora.

Corpus	Corpus Name
eng_us_2019	American English 2019
eng_us_2012	American English 2012
eng_us_2009	American English 2009
eng_gb_2019	British English 2019
eng_gb_2012	British English 2012
eng_gb_2009	British English 2009
chi_sim_2019	Chinese 2019
chi_sim_2012	Chinese 2012
chi_sim_2009	Chinese 2009
eng_2019	English 2019
eng_2012	English 2012
eng_2009	English 2009
eng_fiction_2019	English Fiction 2019
eng_fiction_2012	English Fiction 2012
eng_fiction_2009	English Fiction 2009
eng_1m_2009	Google One Million
fre_2019	French 2019
fre_2012	French 2012
fre_2009	French 2009
ger_2019	German 2019
ger_2012	German 2012
ger_2009	German 2009
heb_2019	Hebrew 2019
heb_2012	Hebrew 2012
heb_2009	Hebrew 2009
spa_2019	Spanish 2019
spa_2012	Spanish 2012
spa_2009	Spanish 2009
rus_2019	Russian 2019
rus_2012	Russian 2012
rus_2009	Russian 2009
ita_2019	Italian 2019
ita_2012	Italian 2012

The Google Million is a sub-collection of Google Books. All are in English with dates ranging from 1500 to 2008. No more than about 6,000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).

See http://books.google.com/ngrams/info for the full Ngram syntax.

Examples

Run this code

# NOT RUN {
ngram(c("mouse", "rat"), year_start = 1950)
ngram(c("blue_ADJ", "red_ADJ"))
ngram(c("_START_ President Roosevelt", "_START_ President Truman"), year_start = 1920)
# }

Run the code above in your browser using DataLab