ngram
downloads data from the Google Ngram Viewer website and
returns it in a tibble.
ngram(
phrases,
corpus = "en-2019",
year_start = 1800,
year_end = 2020,
smoothing = 3,
case_ins = FALSE,
aggregate = FALSE,
count = FALSE,
drop_corpus = FALSE,
drop_parent = FALSE,
drop_all = FALSE,
type = FALSE
)
ngram
returns an object of class "ngram
",
which is a tidyverse tibble
enriched with attributes reflecting
some of the parameters used in the Ngram Viewer query.
vector of phrases, with a maximum of 12 items
Google corpus to search (see Details for possible values)
start year, default is 1800. Data available back to 1500.
end year, default is 2008
smoothing parameter, default is 3
Logical indicating whether to force a case insensitive search.
Default is FALSE
.
Sum up the frequencies for ngrams associated with wildcard
or case insensitive searches. Default is FALSE
.
Default is FALSE
.
When a corpus is specified directly with the ngram
(e.g dog:eng_fiction_2012
) specifies whether the corpus be used retained in
the phrase column of the results. Note that that this method requires that
the old corpus codes (eng_fiction_2012 not en-fiction-2012) are used. Default is FALSE
.
Drop the parent phrase associated with a wildcard
or case-insensitive search. Default is FALSE
.
Delete the suffix "(All)" from aggregated case-insensitive
searches. Default is FALSE
.
Include the Google return type (e.g. NGRAM, NGRAM_COLLECTION,
EXPANSION) from result set. Default is FALSE
.
Google generated two datasets drawn from digitised books in the Google Books collection. One was generated in July 2009, the second in July 2012 and the third in 2019. Google is expected to update these datasets as book scanning continues.
This function provides the annual frequency of words or phrases, known as n-grams, in a sub-collection or "corpus" taken from the Google Books collection.The search across the corpus is case-sensitive.
If the function is unable to retrieve data from the Google Ngram Viewer site (either because of access issues or if the format of Google's site has changed) a NULL result is returned and messages are printed to the console but no errors or warnings are raised (this is to align with CRAN package policies).
Below is a list of available corpora. Note that the data for the 2012 corpuses only extends to 2009.
Corpus | Corpus Name |
en-US-2019 | American English 2019 |
en-US-2012 | American English 2012 |
en-US-2009 | American English 2009 |
en-GB-2019 | British English 2019 |
en-GB-2012 | British English 2012 |
en-GB-2009 | British English 2009 |
zh-Hans-2019 | Chinese 2019 |
zh-Hans-2012 | Chinese 2012 |
zh-Hans-2009 | Chinese 2009 |
en-2019 | English 2019 |
en-2012 | English 2012 |
en-2009 | English 2009 |
en-fiction-2019 | English Fiction 2019 |
en-fiction-2012 | English Fiction 2012 |
en-fiction-2009 | English Fiction 2009 |
en-1M-2009 | English One Million |
fr-2019 | French 2019 |
fr-2012 | French 2012 |
fr-2009 | French 2009 |
de-2019 | German 2019 |
de-2012 | German 2012 |
de-2009 | German 2009 |
iw-2019 | Hebrew 2019 |
iw-2012 | Hebrew 2012 |
iw-2009 | Hebrew 2009 |
es-2019 | Spanish 2019 |
es-2012 | Spanish 2012 |
es-2009 | Spanish 2009 |
ru-2019 | Russian 2019 |
ru-2012 | Russian 2012 |
ru-2009 | Russian 2009 |
it-2019 | Italian 2019 |
it-2012 | Italian 2012 |
The Google Million is a sub-collection of Google Books. All are in English with dates ranging from 1500 to 2008. No more than about 6,000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).
See http://books.google.com/ngrams/info for the full Ngram syntax.
ngram(c("mouse", "rat"), year_start = 1950)
ngram(c("blue_ADJ", "red_ADJ"))
ngram(c("_START_ President Roosevelt", "_START_ President Truman"), year_start = 1920)
Run the code above in your browser using DataLab