Learn R Programming

ngramr (version 1.9.3)

ngram: Get n-gram frequencies

Description

ngram downloads data from the Google Ngram Viewer website and returns it in a tibble.

Usage

ngram(
  phrases,
  corpus = "en-2019",
  year_start = 1800,
  year_end = 2020,
  smoothing = 3,
  case_ins = FALSE,
  aggregate = FALSE,
  count = FALSE,
  drop_corpus = FALSE,
  drop_parent = FALSE,
  drop_all = FALSE,
  type = FALSE
)

Value

ngram returns an object of class "ngram", which is a tidyverse tibble enriched with attributes reflecting some of the parameters used in the Ngram Viewer query.

Arguments

phrases

vector of phrases, with a maximum of 12 items

corpus

Google corpus to search (see Details for possible values)

year_start

start year, default is 1800. Data available back to 1500.

year_end

end year, default is 2008

smoothing

smoothing parameter, default is 3

case_ins

Logical indicating whether to force a case insensitive search. Default is FALSE.

aggregate

Sum up the frequencies for ngrams associated with wildcard or case insensitive searches. Default is FALSE.

count

Default is FALSE.

drop_corpus

When a corpus is specified directly with the ngram (e.g dog:eng_fiction_2012) specifies whether the corpus be used retained in the phrase column of the results. Note that that this method requires that the old corpus codes (eng_fiction_2012 not en-fiction-2012) are used. Default is FALSE.

drop_parent

Drop the parent phrase associated with a wildcard or case-insensitive search. Default is FALSE.

drop_all

Delete the suffix "(All)" from aggregated case-insensitive searches. Default is FALSE.

type

Include the Google return type (e.g. NGRAM, NGRAM_COLLECTION, EXPANSION) from result set. Default is FALSE.

Details

Google generated two datasets drawn from digitised books in the Google Books collection. One was generated in July 2009, the second in July 2012 and the third in 2019. Google is expected to update these datasets as book scanning continues.

This function provides the annual frequency of words or phrases, known as n-grams, in a sub-collection or "corpus" taken from the Google Books collection.The search across the corpus is case-sensitive.

If the function is unable to retrieve data from the Google Ngram Viewer site (either because of access issues or if the format of Google's site has changed) a NULL result is returned and messages are printed to the console but no errors or warnings are raised (this is to align with CRAN package policies).

Below is a list of available corpora. Note that the data for the 2012 corpuses only extends to 2009.

CorpusCorpus Name
en-US-2019American English 2019
en-US-2012American English 2012
en-US-2009American English 2009
en-GB-2019British English 2019
en-GB-2012British English 2012
en-GB-2009British English 2009
zh-Hans-2019Chinese 2019
zh-Hans-2012Chinese 2012
zh-Hans-2009Chinese 2009
en-2019English 2019
en-2012English 2012
en-2009English 2009
en-fiction-2019English Fiction 2019
en-fiction-2012English Fiction 2012
en-fiction-2009English Fiction 2009
en-1M-2009English One Million
fr-2019French 2019
fr-2012French 2012
fr-2009French 2009
de-2019German 2019
de-2012German 2012
de-2009German 2009
iw-2019Hebrew 2019
iw-2012Hebrew 2012
iw-2009Hebrew 2009
es-2019Spanish 2019
es-2012Spanish 2012
es-2009Spanish 2009
ru-2019Russian 2019
ru-2012Russian 2012
ru-2009Russian 2009
it-2019Italian 2019
it-2012Italian 2012

The Google Million is a sub-collection of Google Books. All are in English with dates ranging from 1500 to 2008. No more than about 6,000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).

See http://books.google.com/ngrams/info for the full Ngram syntax.

Examples

Run this code
ngram(c("mouse", "rat"), year_start = 1950)
ngram(c("blue_ADJ", "red_ADJ"))
ngram(c("_START_ President Roosevelt", "_START_ President Truman"), year_start = 1920)

Run the code above in your browser using DataLab