Learn R Programming

koRpus (version 0.04-40)

lex.div: Analyze lexical diversity

Description

This function analyzes the lexical diversity/complexity of a text corpus.

Usage

lex.div(txt, segment = 100, factor.size = 0.72,
    rand.sample = 42, window = 100, case.sens = FALSE,
    lemmatize = FALSE,
    measure = c("TTR", "MSTTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD"),
    char = c("TTR", "MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D", "MTLD"),
    char.steps = 5, force.lang = NULL, keep.tokens = FALSE,
    corp.rm.class = "nonpunct", corp.rm.tag = c(),
    quiet = FALSE)

Arguments

txt
An object of either class kRp.tagged-class, kRp.txt.freq-class,
segment
An integer value for MSTTR, defining how many tokens should form one segment.
factor.size
A real number between 0 and 1, defining the MTLD factor size.
rand.sample
An integer value, how many tokens should be assumed to be drawn for calculating HD-D.
window
An integer value for MATTR, defining how many tokens the moving window should include.
case.sens
Logical, whether types should be counted case sensitive.
lemmatize
Logical, whether analysis should be carried out on the lemmatized tokens rather than all running word forms.
measure
A character vector defining the measures which should be calculated. Valid elements are "TTR", "MSTTR", "MATTR", "C", "R",
char
A character vector defining whether data for plotting characteristic curves should be calculated. Valid elements are "TTR","MATTR", "C", "R", "CTTR", "U", "S", "K", "Maas", "HD-D" and "MTLD".
char.steps
An integer value defining the stepwidth for characteristic curves, in tokens.
force.lang
A character string defining the language to be assumed for the text, by force. See details.
keep.tokens
Logical. If TRUE all raw tokens and types will be preserved in the resulting object, in a slot called tt. For the types, also their frequency in the analyzed text will be listed.
corp.rm.class
A character vector with word classes which should be dropped. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, c("punct","sentc"), list.classes=TRUE) to be used.
corp.rm.tag
A character vector with POS tags which should be dropped.
quiet
Logical. If FALSE, short status messages will be shown.

Value

Details

lex.div calculates a variety of proposed indices for lexical diversity. In the following formulae, $N$ refers to the total number of tokens, and $V$ to the number of types: [object Object],[object Object],[object Object],[object Object],Wrapper function: C.ld,[object Object],Wrapper function: R.ld,[object Object],Wrapper function: CTTR,[object Object],Wrapper function: U.ld,[object Object],Wrapper function: S.ld,[object Object],[object Object],[object Object],[object Object]

By default, if the text has to be tagged yet, the language definition is queried by calling get.kRp.env(lang=TRUE) internally. Or, if txt has already been tagged, by default the language definition of that tagged object is read and used. Set force.lang=get.kRp.env(lang=TRUE) or to any other valid value, if you want to forcibly overwrite this default behaviour, and only then. See kRp.POS.tags for all supported languages.

References

Covington, M.A. & McFall, J.D. (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94--100.

Maas, H.-D., (1972). "Uber den Zusammenhang zwischen Wortschatzumfang und L"ange eines Textes. Zeitschrift f"ur Literaturwissenschaft und Linguistik, 2(8), 73--96.

McCarthy, P.M. & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459--488.

McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaces to lexical diversity assessment. Behaviour Research Methods, 42(2), 381--392.

Tweedie. F.J. & Baayen, R.H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323--352.

See Also

kRp.POS.tags, kRp.tagged-class, kRp.TTR-class

Examples

Run this code
lex.div(tagged.text)

Run the code above in your browser using DataLab