Learn R Programming

ngramr (version 1.5.0)

ngram: Get n-gram frequencies

Description

ngram downloads data from the Google Ngram Viewer website and returns it in a dataframe.

Usage

ngram(phrases, corpus = "eng_2012", year_start = 1500, year_end = 2008, smoothing = 3, count = FALSE, tag = NULL, case_ins = FALSE)

Arguments

phrases
vector of phrases, with a maximum of 12 items
corpus
Google corpus to search (see Details for possible values)
year_start
start year, default is 1500
year_end
end year, default is 2008
smoothing
smoothing paramater, default is 3
count
logical, denoting whether phrase counts should be returned as well as frequencies. Default is FALSE.
tag
apply a part-of-speech tag to the whole vector of phrases
case_ins
Logical indicating whether to force a case insenstive search. Default is FALSE.

Details

Google generated two datasets drawn from digitised books in the Google Books collection. One was generated in July 2009, the second in July 2012. Google will update these datasets as book scanning continues.

This function provides the annual frequency of words or phrases, known as n-grams, in a sub-collection or "corpus" taken from the Google Books collection. The search across the corpus is case-sensitive. For a case-insensitive search use ngrami.

Below is a list of available corpora.

Corpus
Corpus Name
eng_us_2012
American English 2012
eng_us_2009
American English 2009
eng_gb_2012
British English 2012
eng_gb_2009
British English 2009
chi_sim_2012
Chinese 2012
chi_sim_2009
Chinese 2009
eng_2012
English 2012
eng_2009
English 2009
eng_fiction_2012
English Fiction 2012
eng_fiction_2009
English Fiction 2009
eng_1m_2009
Google One Million
fre_2012
French 2012
fre_2009
French 2009
ger_2012
German 2012
ger_2009
German 2009
heb_2012
Hebrew 2012
heb_2009
Hebrew 2009
spa_2012
Spanish 2012
spa_2009
Spanish 2009
rus_2012
Russian 2012
rus_2009
Russian 2009
ita_2012
Italian 2012

The Google Million is a sub-collection of Google Books. All are in English with dates ranging from 1500 to 2008. No more than about 6,000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).

See http://books.google.com/ngrams/info for the full Ngram syntax.

Examples

Run this code
freq <- ngram(c("mouse", "rat"), year_start = 1950)
head(freq)
freq <- ngram(c("blue", "red"), tag = "ADJ")
head(freq)
freq <- ngram(c("President Roosevelt", "President Truman"), tag = "START", year_start = 1920)
head(freq)

Run the code above in your browser using DataLab