Learn R Programming

quanteda (version 0.9.6-1)

ntoken: count the number of tokens or types

Description

Return the count of tokens (total features) or types (unique features) in a text, corpus, or dfm. "tokens" here means all words, not unique words, and these are not cleaned prior to counting.

Usage

ntoken(x, ...)

ntype(x, ...)

## S3 method for class 'corpus': ntoken(x, ...)

## S3 method for class 'corpus': ntype(x, ...)

## S3 method for class 'character': ntoken(x, ...)

## S3 method for class 'tokenizedTexts': ntoken(x, ...)

## S3 method for class 'character': ntype(x, ...)

## S3 method for class 'dfm': ntoken(x, ...)

## S3 method for class 'dfm': ntype(x, ...)

## S3 method for class 'tokenizedTexts': ntype(x, ...)

Arguments

x
texts or corpus whose tokens or types will be counted
...
additional arguments passed to tokenize

Value

  • scalar count of the total tokens or types

Examples

Run this code
# simple example
txt <- c(text1 = "This is a sentence, this.", text2 = "A word. Repeated repeated.")
ntoken(txt)
ntype(txt)
ntoken(toLower(txt))  # same
ntype(toLower(txt))   # fewer types
ntoken(toLower(txt), removePunct = TRUE)
ntype(toLower(txt), removePunct = TRUE)

# with some real texts
ntoken(subset(inaugCorpus, Year<1806, removePunct = TRUE))
ntype(subset(inaugCorpus, Year<1806, removePunct = TRUE))
ntoken(dfm(subset(inaugCorpus, Year<1800)))
ntype(dfm(subset(inaugCorpus, Year<1800)))

Run the code above in your browser using DataLab