sentSplit: Sentence Splitting

Description

sentSplit - Splits turns of talk into individual sentences (provided proper punctuation is used). This procedure is usually done as part of the data read in and cleaning process. sentCombine - Combines sentences by the same grouping variable together. TOT - Convert the tot column from sentSplit to turn of talk index (no sub sentence). Generally, for internal use. sent_detect - Detect and split sentences on endmark boundaries.

Usage

sentSplit(dataframe, text.var, rm.var = NULL, endmarks = c("?", ".", "!",
  "|"), incomplete.sub = TRUE, rm.bracket = TRUE, stem.col = FALSE,
  text.place = "right", ...)

sentCombine(text.var, grouping.var = NULL, as.list = FALSE)

TOT(tot)

sent_detect(text.var, endmarks = c("?", ".", "!", "|"),
  incomplete.sub = TRUE, rm.bracket = TRUE, ...)

Arguments

dataframe

A dataframe that contains the person and text variable.

text.var

The text variable.

rm.var

An optional character vector of 1 or 2 naming the variables that are repeated measures (This will restart the "tot" column).

endmarks

A character vector of endmarks to split turns of talk into sentences.

incomplete.sub

logical. If TRUE detects incomplete sentences and replaces with "|".

rm.bracket

logical. If TRUE removes brackets from the text.

stem.col

logical. If TRUE stems the text as a new column.

text.place

A character string giving placement location of the text column. This must be one of the strings "original", "right" or "left".

...

Additional options passed to stem2df.

grouping.var

The grouping variables. Default NULL generates one word list for all text. Also takes a single grouping variable or a list of 1 or more grouping variables.

tot

A tot column from a sentSplit output.

as.list

logical. If TRUE returns the output as a list. If FALSE the output is returned as a dataframe.

Value

sentSplit - returns a dataframe with turn of talk broken apart into sentences. Optionally a stemmed version of the text variable may be returned as well. sentCombine - returns a list of vectors with the continuous sentences by grouping.var pasted together. returned as well. TOT - returns a numeric vector of the turns of talk without sentence sub indexing (e.g. 3.2 become 3). sent_detect - returns a character vector of sentences split on endmark.

Warning

sentSplit requires the dialogue (text) column to be cleaned in a particular way. The data should contain qdap punctuation marks (

c("?",
  ".", "!", "|")

) at the end of each sentence. Additionally, extraneous punctuation such as abbreviations should be removed (see replace_abbreviation). Trailing sentences such as I thought I... will be treated as incomplete and marked with "|" to denote an incomplete/trailing sentence.

Examples

Run this code

## `sentSpli`t EXAMPLE:
(out <- sentSplit(DATA, "state"))
sentSplit(DATA, "state", stem.col = TRUE)
sentSplit(DATA, "state", text.place = "left")
sentSplit(DATA, "state", text.place = "original")
sentSplit(raj, "dialogue")[1:20, ]

## plotting
plot(out)
plot(out, grouping.var = "person")

out2 <- sentSplit(DATA2, "state", rm.var = c("class", "day"))
plot(out2)
plot(out2, grouping.var = "person")
plot(out2, grouping.var = "person", rm.var = "day")
plot(out2, grouping.var = "person", rm.var = c("day", "class"))

## `sentCombine` EXAMPLE:
dat <- sentSplit(DATA, "state")
sentCombine(dat$state, dat$person)
truncdf(sentCombine(dat$state, dat$sex), 50)

## `TOT` EXAMPLE:
dat <- sentSplit(DATA, "state")
TOT(dat$tot)

## `sent_detect`
sent_detect(DATA$state)

Run the code above in your browser using DataLab