Learn R Programming

nzilbb.labbcat (version 1.0-1)

getMatches: Search for tokens.

Description

Searches through transcripts for tokens matching the given pattern.

Usage

getMatches(
  labbcat.url,
  pattern,
  participant.ids = NULL,
  transcript.types = NULL,
  main.participant = TRUE,
  aligned = FALSE,
  matches.per.transcript = NULL,
  words.context = 0,
  max.matches = NULL,
  overlap.threshold = NULL,
  page.length = 1000,
  no.progress = FALSE
)

Arguments

labbcat.url

URL to the LaBB-CAT instance

pattern

An object representing the pattern to search for.

Strictly speaking, this should be a named list that replicates the structure of the `search matrix' in the LaBB-CAT browser interface, with one element called ``columns'', containing a named list for each column.

Each element in the ``columns'' named list contains an element named ``layers'', whose value is a named list for patterns to match on each layer, and optionally an element named ``adj'', whose value is a number representing the maximum distance, in tokens, between this column and the next column - if ``adj'' is not specified, the value defaults to 1, so tokens are contiguous.

Each element in the ``layers'' named list is named after the layer it matches, and the value is a named list with the following possible elements:

  • pattern A regular expression to match against the label

  • min An inclusive minimum numeric value for the label

  • max An exclusive maximum numeric value for the label

  • not TRUE to negate the match

  • anchorStart TRUE to anchor to the start of the annotation on this layer (i.e. the matching word token will be the first at/after the start of the matching annotation on this layer)

  • anchorEnd TRUE to anchor to the end of the annotation on this layer (i.e. the matching word token will be the last before/at the end of the matching annotation on this layer)

  • target TRUE to make this layer the target of the search; the results will contain one row for each match on the target layer

Examples of valid pattern objects include:

## words starting with 'ps...'
pattern <- list(columns = list(
    list(layers = list(
           orthography = list(pattern = "ps.*")))))

## the word 'the' followed immediately or with one intervening word by ## a hapax legomenon (word with a frequency of 1) that doesn't start with a vowel pattern <- list(columns = list( list(layers = list( orthography = list(pattern = "the")), adj = 2), list(layers = list( phonemes = list(not = TRUE, pattern = "[cCEFHiIPqQuUV0123456789~#\$@].*"), frequency = list(max = "2")))))

For ease of use, the function will also accept the following abbreviated forms:

## a single list representing a 'one column' search, 
## and string values, representing regular expression pattern matching
pattern <- list(orthography = "ps.*")

## a list containing the columns (adj defaults to 1, so matching tokens are contiguous)... pattern <- list( list(orthography = "the"), list(phonemes = list(not = TRUE, pattern = "[cCEFHiIPqQuUV0123456789~#\$@].*"), frequency = list(max = "2")))

participant.ids

An optional list of participant IDs to search the utterances of. If not supplied, all utterances in the corpus will be searched.

transcript.types

An optional list of transcript types to limit the results to. If null, all transcript types will be searched.

main.participant

TRUE to search only main-participant utterances, FALSE to search all utterances.

aligned

true to include only words that are aligned (i.e. have anchor confidence &ge; 50, false to search include un-aligned words as well.

matches.per.transcript

Optional maximum number of matches per transcript to return. NULL means all matches.

words.context

Number of words context to include in the `Before.Match' and `After.Match' columns in the results.

max.matches

The maximum number of matches to return, or null to return all.

overlap.threshold

The percentage overlap with other utterances before simultaneous speech is excluded, or null to include overlapping speech.

page.length

In order to prevent timeouts when there are a large number of matches or the network connection is slow, rather than retrieving matches in one big request, they are retrieved using many smaller requests. This parameter controls the number of results retrieved per request.

no.progress

TRUE to supress visual progress bar. Otherwise, progress bar will be shown when interactive().

Value

A data frame identifying matches, containing the following columns:

  • SearchName A name based on the pattern -- the same for all rows

  • MatchId A unique ID for the matching target token

  • Transcript Name of the transcript in which the match was found

  • Participant Name of the speaker

  • Corpus The corpus of the transcript

  • Line The start offset of the utterance/line

  • LineEnd The end offset of the utterance/line

  • Before.Match Transcript text immediately before the match

  • Text Transcript text of the match

  • After.Match Transcript text immediately after the match

  • Number Row number

  • URL URL of the first matching word token

  • Target.word Text of the target word token

  • Target.word.start Start offset of the target word token

  • Target.word.end End offset of the target word token

  • Target.segment Label of the target segment (only present if the segment layer is included in the pattern)

  • Target.segment.start Start offset of the target segment (only present if the segment layer is included in the pattern)

  • Target.segment.end End offset of the target segment (only present if the segment layer is included in the pattern)

See Also

getParticipantIds

Examples

Run this code
# NOT RUN {
## define the LaBB-CAT URL
labbcat.url <- "https://labbcat.canterbury.ac.nz/demo/"

## create a pattern object to match against
pattern <- list(columns = list(
    list(layers = list(
           orthography = list(pattern = "the")),
         adj = 2),
    list(layers = list(
           phonemes = list(not=TRUE, pattern = "[cCEFHiIPqQuUV0123456789~#\\$@].*"),
           frequency = list(max = "2")))))

## get the tokens matching the pattern, excluding overlapping speech
results <- getMatches(labbcat.url, pattern, overlap.threshold = 5)

## results$MatchId can be used to access results
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab