udpipe: Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF format

Description

Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF format

Usage

udpipe(x, object, parallel.cores = 1L, parallel.chunksize, ...)

Arguments

either

a character vector: The character vector contains the text you want to tokenize, lemmatise, tag and perform dependency parsing. The names of the character vector indicate the document identifier.
a data.frame with columns doc_id and text: The text column contains the text you want to tokenize, lemmatise, tag and perform dependency parsing. The doc_id column indicate the document identifier.
a list of tokens: If you have already a tokenised list of tokens and you want to enrich it by lemmatising, tagging and performing dependency parsing. The names of the list indicate the document identifier.

All text data should be in UTF-8 encoding

object

either an object of class udpipe_model as returned by udpipe_load_model, the path to the file on disk containing the udpipe model or the language as defined by udpipe_download_model. If the language is provided, it will download the model using udpipe_download_model.

parallel.cores

integer indicating the number of parallel cores to use to speed up the annotation. Defaults to 1 (use only 1 single thread). If more than 1 is specified, it uses parallel::mclapply (unix) or parallel::clusterApply (windows) to run annotation in parallel. In order to do this on Windows it runs first parallel::makeCluster to set up a local socket cluster, on unix it just uses forking to parallelise the annotation. Only set this if you have more than 1 CPU at disposal and you have large amount of data to annotate as setting up a parallel backend also takes some time plus annotations will run in chunks set by parallel.chunksize and for each parallel chunk the udpipe model will be loaded which takes also some time. If parallel.cores is bigger than 1 and object is of class udpipe_model, it will load the corresponding file from the model again in each parallel chunk.

parallel.chunksize

integer with the size of the chunks of text to be annotated in parallel. If not provided, defaults to the size of x divided by parallel.cores. Only used in case parallel.cores is bigger than 1.

...

other elements to pass on to udpipe_annotate and udpipe_download_model

Value

a data.frame with one row per doc_id and term_id containing all the tokens in the data, the lemma, the part of speech tags, the morphological features and the dependency relationship along the tokens. The data.frame has the following fields:

doc_id: The document identifier.
paragraph_id: The paragraph identifier which is unique within each document.
sentence_id: The sentence identifier which is unique within each document.
sentence: The text of the sentence of the sentence_id.
start: Integer index indicating in the original text where the token starts. Missing in case of tokens part of multi-word tokens which are not in the text.
end: Integer index indicating in the original text where the token ends. Missing in case of tokens part of multi-word tokens which are not in the text.
term_id: A row identifier which is unique within the doc_id identifier.
token_id: Token index, integer starting at 1 for each new sentence. May be a range for multiword tokens or a decimal number for empty nodes.
token: The token.
lemma: The lemma of the token.
upos: The universal parts of speech tag of the token. See http://universaldependencies.org/format.html
xpos: The treebank-specific parts of speech tag of the token. See http://universaldependencies.org/format.html
feats: The morphological features of the token, separated by |. See http://universaldependencies.org/format.html
head_token_id: Indicating what is the token_id of the head of the token, indicating to which other token in the sentence it is related. See http://universaldependencies.org/format.html
dep_rel: The type of relation the token has with the head_token_id. See http://universaldependencies.org/format.html
deps: Enhanced dependency graph in the form of a list of head-deprel pairs. See http://universaldependencies.org/format.html
misc: SpacesBefore/SpacesAfter/SpacesInToken spaces before/after/inside the token. Used to reconstruct the original text. See http://ufal.mff.cuni.cz/udpipe/users-manual

The columns paragraph_id, sentence_id, term_id, start, end are integers, the other fields are character data in UTF-8 encoding.

References

https://ufal.mff.cuni.cz/udpipe, https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364, http://universaldependencies.org/format.html

Examples

Run this code

# NOT RUN {
model    <- udpipe_download_model(language = "dutch-lassysmall")
if(!model$download_failed){
ud_dutch <- udpipe_load_model(model)

## Tokenise, Tag and Dependency Parsing Annotation. Output is in CONLL-U format.
txt <- c("Dus. Godvermehoeren met pus in alle puisten, 
  zei die schele van Van Bukburg en hij had nog gelijk ook. 
  Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont, 
  maar dat hoefden we geenseens niet te zingen. 
  Je kunt zeggen wat je wil van al die gesluierde poezenpas maar d'r kwam wel 
  een vleeswarenwinkel onder te voorschijn van heb je me daar nou.
  
  En zo gaat het maar door.",
  "Wat die ransaap van een academici nou weer in z'n botte pan heb gehaald mag 
  Joost in m'n schoen gooien, maar feit staat boven water dat het een gore 
  vieze vuile ransaap is.")
names(txt) <- c("document_identifier_1", "we-like-ilya-leonard-pfeiffer")

##
## TIF tagging: tag if x is a character vector, a data frame or a token sequence
##
x <- udpipe(txt, object = ud_dutch)
x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), 
            object = ud_dutch)
x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), 
            object = ud_dutch)
       
## You can also directly pass on the language in the call to udpipe
x <- udpipe("Dit werkt ook.", object = "dutch-lassysmall")
x <- udpipe(txt, object = "dutch-lassysmall")
x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), 
            object = "dutch-lassysmall")
x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), 
            object = "dutch-lassysmall")
}
            
## cleanup for CRAN only - you probably want to keep your model if you have downloaded it
if(file.exists(model$file_model)) file.remove(model$file_model)
# }

Run the code above in your browser using DataLab