read_dir_transcript: Read In Multiple Transcript Files From a Directory

Description

Read in multiple transcript files from a directory and create a base::data.frame().

Usage

read_dir_transcript(
  path,
  col.names = c("Document", "Person", "Dialogue"),
  pattern = NULL,
  all.files = FALSE,
  recursive = FALSE,
  skip = 0,
  merge.broke.tot = TRUE,
  header = FALSE,
  dash = "",
  ellipsis = "...",
  quote2bracket = FALSE,
  rm.empty.rows = TRUE,
  na = "",
  sep = NULL,
  comment.char = "",
  max.person.nchar = 20,
  ignore.case = FALSE,
  verbose = FALSE,
  ...
)

Arguments

path

Path to the directory.

col.names

A character vector specifying the column names of the transcript columns (document, person, dialogue).

pattern

An optional regular expression. Only file names which match the regular expression will be returned.

all.files

Logical. If FALSE, only the names of visible files are returned. If TRUE, all file names will be returned.

recursive

Logical. Should the listing recurse into directories?

skip

Integer; the number of lines of the data file to skip before beginning to read data.

merge.broke.tot

logical. If TRUE and if the file being read in is .docx with broken space between a single turn of talk read_transcript will attempt to merge these into a single turn of talk.

header

logical. If TRUE the file contains the names of the variables as its first line.

dash

A character string to replace the en and em dashes special characters (default is to remove).

ellipsis

A character string to replace the ellipsis special characters.

quote2bracket

logical. If TRUE replaces curly quotes with curly braces (default is FALSE). If FALSE curly quotes are removed.

rm.empty.rows

logical. If TRUE read_transcript() attempts to remove empty rows.

A character string to be interpreted as an NA value.

sep

The field separator character. Values on each line of the file are separated by this character. The default of NULL instructs read_transcript() to use a separator suitable for the file type being read in.

comment.char

A character vector of length one containing a single character or an empty string. Use "" to turn off the interpretation of comments altogether.

max.person.nchar

The max number of characters long names are expected to be. This information is used to warn the user if a separator appears beyond this length in the text.

ignore.case

logical. If TRUE case in the pattern argument will be ignored.

verbose

Logical. Should Each iteration of the read-in be reported.

...

ignored.

Value

Returns a dataframe of documents, dialogue, and people.

Examples

Run this code

# NOT RUN {
skips <- c(0, 1, 1, 0, 0, 1)
path <- system.file("docs/transcripts", package = 'textreadr')
textreadr::peek(read_dir_transcript(path, skip = skips), Inf)

# }
# NOT RUN {
## with additional  cleaning
library(tidyverse, textshape, textclean)

path %>%
    read_dir_transcript(skip = skips) %>%
    textclean::filter_row("Person", "^\\[") %>%
    mutate(
        Person = stringi::stri_replace_all_regex(Person, "(^/\\s*)|(:\\s*$)", "") %>%
            trimws(),
        Dialogue = stringi::stri_replace_all_regex(Dialogue, "(^/\\s*)", "")
    ) %>%
    peek(Inf)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples