Learn R Programming

textTinyR (version 1.1.2)

token_stats: token statistics

Description

token statistics

Usage

# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL,

# file_delimiter = ' ', n_gram_delimiter = "_")

Arguments

x_vec

either NULL or a string character vector

path_2folder

either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)

path_2file

either NULL or a valid path to a file

file_delimiter

either NULL or a character string specifying the file delimiter

n_gram_delimiter

either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function

subset

either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)

number

a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.

word

a character string for the print_collocations and print_prob_next functions

dice_n_gram

a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function

method

a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.

split_separator

a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"

dice_thresh

a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.

upper

either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's

diagonal

either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's

threads

a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function

n_grams

a numeric value specifying the n-grams in the look_up_table function

n_gram

a character string specifying the n-gram to use in the print_words_lookup_tbl function

Format

An object of class R6ClassGenerator of length 24.

Methods

token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")

--------------

path_2vector()

--------------

freq_distribution()

--------------

print_frequency(subset = NULL)

--------------

count_character()

--------------

print_count_character(number = NULL)

--------------

collocation_words()

--------------

print_collocations(word = NULL)

--------------

string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)

--------------

look_up_table(n_grams = NULL)

--------------

print_words_lookup_tbl(n_gram = NULL)

Details

the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file

the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function

the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function

the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function

the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).

the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.

Examples

Run this code
# NOT RUN {

library(textTinyR)

expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token')

tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL)

#-------------------------
# frequency distribution:
#-------------------------

tk$freq_distribution()

# tk$print_frequency()


#------------------
# count characters:
#------------------

cnt <- tk$count_character()

# tk$print_count_character(number = 4)


#----------------------
# collocation of words:
#----------------------

col <- tk$collocation_words()

# tk$print_collocations(word = 'five')


#-----------------------------
# string dissimilarity matrix:
#-----------------------------

dism <- tk$string_dissimilarity_matrix(method = 'levenshtein')


#---------------------
# build a look-up-table:
#---------------------

lut <- tk$look_up_table(n_grams = 3)

# tk$print_words_lookup_tbl(n_gram = 'e_w')
# }

Run the code above in your browser using DataLab