ngram
downloads data from the Google Ngram Viewer website and
returns it in a dataframe.
ngram(phrases, corpus = "eng_2012", year_start = 1500, year_end = 2008, smoothing = 3, count = FALSE, tag = NULL, case_ins = FALSE)
FALSE
.FALSE
. This function provides the annual frequency of words or phrases, known as
n-grams, in a sub-collection or "corpus" taken from the Google Books collection.
The search across the corpus is case-sensitive. For a case-insensitive search
use ngrami
.
Below is a list of available corpora.
Corpus |
Corpus Name |
eng_us_2012 |
American English 2012 |
eng_us_2009 |
American English 2009 |
eng_gb_2012 |
British English 2012 |
eng_gb_2009 |
British English 2009 |
chi_sim_2012 |
Chinese 2012 |
chi_sim_2009 |
Chinese 2009 |
eng_2012 |
English 2012 |
eng_2009 |
English 2009 |
eng_fiction_2012 |
English Fiction 2012 |
eng_fiction_2009 |
English Fiction 2009 |
eng_1m_2009 |
Google One Million |
fre_2012 |
French 2012 |
fre_2009 |
French 2009 |
ger_2012 |
German 2012 |
ger_2009 |
German 2009 |
heb_2012 |
Hebrew 2012 |
heb_2009 |
Hebrew 2009 |
spa_2012 |
Spanish 2012 |
spa_2009 |
Spanish 2009 |
rus_2012 |
Russian 2012 |
rus_2009 |
Russian 2009 |
ita_2012 |
Italian 2012 |
The Google Million is a sub-collection of Google Books. All are in English with dates ranging from 1500 to 2008. No more than about 6,000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).
See http://books.google.com/ngrams/info for the full Ngram syntax.
freq <- ngram(c("mouse", "rat"), year_start = 1950)
head(freq)
freq <- ngram(c("blue", "red"), tag = "ADJ")
head(freq)
freq <- ngram(c("President Roosevelt", "President Truman"), tag = "START", year_start = 1920)
head(freq)
Run the code above in your browser using DataLab