Learn R Programming

udpipe - R package for Tokenization, Tagging, Lemmatization and Dependency Parsing Based on UDPipe

This repository contains an R package which is an Rcpp wrapper around the UDPipe C++ library (http://ufal.mff.cuni.cz/udpipe, https://github.com/ufal/udpipe).

  • UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
  • The techniques used are explained in detail in the paper: "Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe", available at https://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf. In that paper, you'll also find accuracies on different languages and process flow speed (measured in words per second).

General

The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:

  • Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language
  • Provide easy access to pre-trained annotation models
  • Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 100 treebanks available at http://universaldependencies.org
  • Don't rely on Python or Java so that R users can easily install this package without configuration hassle
  • No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)

Installation & License

The package is available under the Mozilla Public License Version 2.0. Installation can be done as follows. Please visit the package documentation at https://bnosac.github.io/udpipe/en and look at the R package vignettes for further details.

install.packages("udpipe")
vignette("udpipe-tryitout", package = "udpipe")
vignette("udpipe-annotation", package = "udpipe")
vignette("udpipe-universe", package = "udpipe")
vignette("udpipe-usecase-postagging-lemmatisation", package = "udpipe")
# An overview of keyword extraction techniques: https://bnosac.github.io/udpipe/docs/doc7.html
vignette("udpipe-usecase-topicmodelling", package = "udpipe")
vignette("udpipe-parallel", package = "udpipe")
vignette("udpipe-train", package = "udpipe")

For installing the development version of this package: remotes::install_github("bnosac/udpipe", build_vignettes = TRUE)

Example

Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel

    language                                                                             file_model
dutch-alpino C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-alpino-ud-2.5-191206.udpipe

x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
            object = udmodel)
x
 doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos                                        xpos                               feats head_token_id      dep_rel            misc
   doc1            1           1     1   2       1        1        Ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             2        nsubj            <NA>
   doc1            1           1     4   7       2        2      ging      gaan  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             0         root            <NA>
   doc1            1           1     9  10       3        3        op        op   ADP                                     VZ|init                                <NA>             4         case            <NA>
   doc1            1           1    12  15       4        4      reis      reis  NOUN                  N|soort|ev|basis|zijd|stan              Gender=Com|Number=Sing             2          obl            <NA>
   doc1            1           1    17  18       5        5        en        en CCONJ                                    VG|neven                                <NA>             7           cc            <NA>
   doc1            1           1    20  21       6        6        ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             7        nsubj            <NA>
   doc1            1           1    23  25       7        7       nam     nemen  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             2         conj            <NA>
   doc1            1           1    27  29       8        8       mee       mee   ADP                                      VZ|fin                                <NA>             7 compound:prt   SpaceAfter=No
   doc1            1           1    30  30       9        9         :         : PUNCT                                         LET                                <NA>             7        punct            <NA>
...

Pre-trained models

Pre-trained models build on Universal Dependencies treebanks are made available for more than 65 languages based on 101 treebanks, namely:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb.

These have been made available easily to users of the package by using udpipe_download_model

How good are these models?

Train your own models based on CONLL-U data

The package also allows you to build your own annotation model. For this, you need to provide data in CONLL-U format. These are provided for many languages at https://universaldependencies.org, mostly under the CC-BY-SA license. How this is done is detailed in the package vignette.

vignette("udpipe-train", package = "udpipe")

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Copy Link

Version

Install

install.packages('udpipe')

Monthly Downloads

6,460

Version

0.8.11

License

MPL-2.0

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

January 6th, 2023

Functions in udpipe (0.8.11)

as_phrasemachine

Convert Parts of Speech tags to one-letter tags which can be used to identify phrases based on regular expressions
brussels_reviews

Reviews of AirBnB customers on Brussels address locations available at www.insideairbnb.com
as_cooccurrence

Convert a matrix to a co-occurrence data.frame
as_fasttext

Combine labels and text as used in fasttext
as.data.frame.udpipe_connlu

Convert the result of udpipe_annotate to a tidy data frame
as_conllu

Convert a data.frame to CONLL-U format
brussels_reviews_anno

Reviews of the AirBnB customers which are tokenised, POS tagged and lemmatised
brussels_listings

Brussels AirBnB address locations available at www.insideairbnb.com
as.matrix.cooccurrence

Convert the result of cooccurrence to a sparse matrix
cbind_morphological

Add morphological features to an annotated dataset
cooccurrence

Create a cooccurence data.frame
brussels_reviews_w2v_embeddings_lemma_nl

An example matrix of word embeddings
document_term_matrix

Create a document/term matrix
cbind_dependencies

Add the dependency parsing information to an annotated dataset
dtm_bind

Combine 2 document term matrices either by rows or by columns
dtm_remove_tfidf

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency
dtm_chisq

Compare term usage across 2 document groups using the Chi-square Test for Count Data
dtm_reverse

Inverse operation of the document_term_matrix function
document_term_frequencies

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document
dtm_align

Reorder a Document-Term-Matrix alongside a vector or data.frame
dtm_colsums

Column sums and Row sums for document term matrices
dtm_cor

Pearson Correlation for Sparse Matrices
dtm_remove_lowfreq

Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms
dtm_conform

Make sure a document term matrix has exactly the specified rows and columns
keywords_phrases

Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tags
dtm_remove_terms

Remove terms from a Document-Term-Matrix and keep only documents which have a least some terms
paste.data.frame

Concatenate text of each group of data together
dtm_remove_sparseterms

Remove terms with high sparsity from a Document-Term-Matrix
predict.LDA_VEM

Predict method for an object of class LDA_VEM or class LDA_Gibbs
syntaxrelation-class

Experimental and undocumented querying of syntax relationships
txt_collapse

Collapse a character vector while removing missing data.
document_term_frequencies_statistics

Add Term Frequency, Inverse Document Frequency and Okapi BM25 statistics to the output of document_term_frequencies
txt_contains

Check if text contains a certain pattern
txt_previous

Get the n-th previous element of a vector
txt_highlight

Highlight words in a character vector
txt_grepl

Look up a multiple patterns and indicate their presence in text
keywords_rake

Keyword identification using Rapid Automatic Keyword Extraction (RAKE)
txt_previousgram

Based on a vector with a word sequence, get n-grams (looking backward)
txt_tagsequence

Identify a contiguous sequence of tags as 1 being entity
unique_identifier

Create a unique identifier for each combination of fields in a data frame
txt_context

Based on a vector with a word sequence, get n-grams (looking forward + backward)
txt_show

Boilerplate function to cat only 1 element of a character vector.
udpipe_read_conllu

Read in a CONLL-U file as a data.frame
udpipe_annotate

Tokenising, Lemmatising, Tagging and Dependency Parsing Annotation of raw text
udpipe_train

Train a UDPipe model
dtm_sample

Random samples and permutations from a Document-Term-Matrix
udpipe

Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF format
txt_freq

Frequency statistics of elements in a vector
udpipe_accuracy

Evaluate the accuracy of your UDPipe model on holdout data
syntaxpatterns-class

Experimental and undocumented querying of syntax patterns
strsplit.data.frame

Obtain a tokenised data frame by splitting text alongside a regular expression
txt_count

Count the number of times a pattern is occurring in text
udpipe_annotation_params

List with training options set by the UDPipe community when building models based on the Universal Dependencies data
txt_next

Get the n-th next element of a vector
txt_nextgram

Based on a vector with a word sequence, get n-grams (looking forward)
unlist_tokens

Create a data.frame from a list of tokens
dtm_tfidf

Term Frequency - Inverse Document Frequency calculation
txt_recode

Recode text to other categories
udpipe_load_model

Load an UDPipe model
txt_recode_ngram

Recode words with compound multi-word expressions
udpipe_download_model

Download an UDPipe model provided by the UDPipe community for a specific language of choice
txt_overlap

Get the overlap between 2 vectors
dtm_svd_similarity

Semantic Similarity to a Singular Value Decomposition
keywords_collocation

Extract collocations - a sequence of terms which follow each other
txt_paste

Concatenate strings with options how to handle missing data
txt_sample

Boilerplate function to sample one element from a vector.
txt_sentiment

Perform dictionary-based sentiment analysis on a tokenised data frame
as_word2vec

Convert a matrix of word vectors to word2vec format