topicsGrams

The function computes ngrams from a text

Implements differential language analysis with statistical tests and offers various language visualization techniques for n-grams and topics. It also supports the 'text' package. For more information, visit <https://r-topics.org/> and <https://www.r-text.org/>.

Oscar Kjell

topics

Creating and Significance Testing Language Features for
Visualisation

Leon Ackermann

Zhuojun Gu

topicsGrams function

<dl><dt>data</dt>
<dd>(tibble) The data</dd>
<dt>ngram_window</dt>
<dd>(list) the minimum and maximum n-gram length, e.g. c(1,3)</dd>
<dt>stopwords</dt>
<dd>(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball")</dd>
<dt>occurance_rate</dt>
<dd>(numerical) The occurance rate (0-1) removes words that occur less then in (occurance_rate)*(number of documents). Example: If the training dataset has 1000 documents and the occurrence rate is set to 0.05, the code will remove terms that appear in less than 50 documents.</dd>
<dt>removal_mode</dt>
<dd>(character) The mode of removal, either "term", frequency" or "percentage"</dd>
<dt>removal_rate_most</dt>
<dd>(numeric) The rate of most frequent ngrams to remove</dd>
<dt>removal_rate_least</dt>
<dd>(numeric) The rate of least frequent ngrams to remove</dd>
<dt>pmi_threshold</dt>
<dd>(integer) The pmi threshold, if it shall not be used set to 0</dd>
<dt>top_frequent</dt>
<dd>(integer) The number of most frequently occuring ngrams to included in the output.</dd></dl>

Arguments

N-grams — topicsGrams

<dl>

<dt>data</dt>
<dd>(tibble) The data</dd>


<dt>ngram_window</dt>
<dd>(list) the minimum and maximum n-gram length, e.g. c(1,3)</dd>


<dt>stopwords</dt>
<dd>(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball")</dd>


<dt>occurance_rate</dt>
<dd>(numerical) The occurance rate (0-1) removes words that occur less then in (occurance_rate)*(number of documents). Example: If the training dataset has 1000 documents and the occurrence rate is set to 0.05, the code will remove terms that appear in less than 50 documents.</dd>


<dt>removal_mode</dt>
<dd>(character) The mode of removal, either "term", frequency" or "percentage"</dd>


<dt>removal_rate_most</dt>
<dd>(numeric) The rate of most frequent ngrams to remove</dd>


<dt>removal_rate_least</dt>
<dd>(numeric) The rate of least frequent ngrams to remove</dd>


<dt>pmi_threshold</dt>
<dd>(integer) The pmi threshold, if it shall not be used set to 0</dd>


<dt>top_frequent</dt>
<dd>(integer) The number of most frequently occuring ngrams to included in the output.</dd>

</dl>

topicsGrams: N-grams

Description

Usage

Value

Arguments