ngram: Get n-gram frequencies

Description

ngram downloads data from the Google Ngram Viewer website and returns it in a dataframe.

Usage

ngram(phrases, corpus = "eng_2012", year_start = 1500, year_end = 2008, smoothing = 3, count = FALSE, tag = NULL, case_ins = FALSE)

Arguments

phrases

vector of phrases, with a maximum of 12 items

corpus

Google corpus to search (see Details for possible values)

year_start

start year, default is 1500

year_end

end year, default is 2008

smoothing

smoothing paramater, default is 3

count

logical, denoting whether phrase counts should be returned as well as frequencies. Default is FALSE.

tag

apply a part-of-speech tag to the whole vector of phrases

case_ins

Logical indicating whether to force a case insenstive search. Default is FALSE.

Details

Google generated two datasets drawn from digitised books in the Google Books collection. One was generated in July 2009, the second in July 2012. Google will update these datasets as book scanning continues.

This function provides the annual frequency of words or phrases, known as n-grams, in a sub-collection or "corpus" taken from the Google Books collection. The search across the corpus is case-sensitive. For a case-insensitive search use ngrami.

Below is a list of available corpora.

Corpus

Corpus Name

eng_us_2012

American English 2012

eng_us_2009

American English 2009

eng_gb_2012

British English 2012

eng_gb_2009

British English 2009

chi_sim_2012

Chinese 2012

chi_sim_2009

Chinese 2009

eng_2012

English 2012

eng_2009

English 2009

eng_fiction_2012

English Fiction 2012

eng_fiction_2009

English Fiction 2009

eng_1m_2009

Google One Million

fre_2012

French 2012

fre_2009

French 2009

ger_2012

German 2012

ger_2009

German 2009

heb_2012

Hebrew 2012

heb_2009

Hebrew 2009

spa_2012

Spanish 2012

spa_2009

Spanish 2009

rus_2012

Russian 2012

rus_2009

Russian 2009

ita_2012

Italian 2012

The Google Million is a sub-collection of Google Books. All are in English with dates ranging from 1500 to 2008. No more than about 6,000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).

See http://books.google.com/ngrams/info for the full Ngram syntax.

Examples

Run this code

freq <- ngram(c("mouse", "rat"), year_start = 1950)
head(freq)
freq <- ngram(c("blue", "red"), tag = "ADJ")
head(freq)
freq <- ngram(c("President Roosevelt", "President Truman"), tag = "START", year_start = 1920)
head(freq)

Run the code above in your browser using DataLab