Learn R Programming

⚠️There's a newer version (0.8.11) of this package.Take me there.

udpipe - R package for Tokenization, Tagging, Lemmatization and Dependency Parsing Based on UDPipe

This repository contains an R package which is an Rcpp wrapper around the UDPipe C++ library (http://ufal.mff.cuni.cz/udpipe, https://github.com/ufal/udpipe).

  • UDPipe provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.
  • The techniques used are explained in detail in the paper: "Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe", available at http://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf. In that paper, you'll also find accuracies on different languages and process flow speed (measured in words per second).

General

The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:

  • Give R users simple access in order to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language
  • Provide easy access to pre-trained annotation models
  • Allow R users to easily construct your own annotation model based on data in CONLL-U format as provided in more than 60 treebanks available at http://universaldependencies.org/#ud-treebanks
  • Don't rely on Python or Java so that R users can easily install this package without configuration hassle
  • No external R package dependencies except the strict necessary (Rcpp and data.table, no tidyverse)

Installation & License

The package is available under the Mozilla Public License Version 2.0. Installation can be done as follows. Please visit the package documentation at https://bnosac.github.io/udpipe/en and look at the R package vignettes for further details.

install.packages("udpipe")
vignette("udpipe-tryitout", package = "udpipe")
vignette("udpipe-annotation", package = "udpipe")
vignette("udpipe-usecase-postagging-lemmatisation", package = "udpipe")
# An overview of keyword extraction techniques: https://bnosac.github.io/udpipe/docs/doc7.html
vignette("udpipe-usecase-topicmodelling", package = "udpipe")
vignette("udpipe-parallel", package = "udpipe")
vignette("udpipe-train", package = "udpipe")

For installing the development version of this package: devtools::install_github("bnosac/udpipe", build_vignettes = TRUE)

Example

Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe

library(udpipe)
udmodel <- udpipe_download_model(language = "dutch")
udmodel

    language                                                                             file_model
dutch-alpino C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/udpipe/dutch-alpino-ud-2.4-190531.udpipe

x <- udpipe(x = "Ik ging op reis en ik nam mee: mijn laptop, mijn zonnebril en goed humeur.",
            object = udmodel)
x
 doc_id paragraph_id sentence_id start end term_id token_id     token     lemma  upos                                        xpos                               feats head_token_id      dep_rel            misc
   doc1            1           1     1   2       1        1        Ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             2        nsubj            <NA>
   doc1            1           1     4   7       2        2      ging      gaan  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             0         root            <NA>
   doc1            1           1     9  10       3        3        op        op   ADP                                     VZ|init                                <NA>             4         case            <NA>
   doc1            1           1    12  15       4        4      reis      reis  NOUN                  N|soort|ev|basis|zijd|stan              Gender=Com|Number=Sing             2          obl            <NA>
   doc1            1           1    17  18       5        5        en        en CCONJ                                    VG|neven                                <NA>             7           cc            <NA>
   doc1            1           1    20  21       6        6        ik        ik  PRON                VNW|pers|pron|nomin|vol|1|ev      Case=Nom|Person=1|PronType=Prs             7        nsubj            <NA>
   doc1            1           1    23  25       7        7       nam     nemen  VERB                               WW|pv|verl|ev Number=Sing|Tense=Past|VerbForm=Fin             2         conj            <NA>
   doc1            1           1    27  29       8        8       mee       mee   ADP                                      VZ|fin                                <NA>             7 compound:prt   SpaceAfter=No
   doc1            1           1    30  30       9        9         :         : PUNCT                                         LET                                <NA>             7        punct            <NA>
...

Pre-trained models

Pre-trained models build on Universal Dependencies treebanks are made available for more than 64 languages based on 97 treebanks, namely:

afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb.

These have been made available easily to users of the package by using udpipe_download_model

How good are these models?

Train your own models based on CONLL-U data

The package also allows you to build your own annotation model. For this, you need to provide data in CONLL-U format. These are provided for many languages at http://universaldependencies.org/#ud-treebanks, mostly under the CC-BY-SA license. How this is done is detailed in the package vignette.

vignette("udpipe-train", package = "udpipe")

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Copy Link

Version

Install

install.packages('udpipe')

Monthly Downloads

6,460

Version

0.8.3

License

MPL-2.0

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

July 5th, 2019

Functions in udpipe (0.8.3)

dtm_remove_lowfreq

Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms
document_term_frequencies_statistics

Add Term Frequency, Inverse Document Frequency and Okapi BM25 statistics to the output of document_term_frequencies
document_term_frequencies

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document
dtm_remove_sparseterms

Remove terms with high sparsity from a Document-Term-Matrix
keywords_phrases

Extract phrases - a sequence of terms which follow each other based on a sequence of Parts of Speech tags
keywords_collocation

Extract collocations - a sequence of terms which follow each other
txt_recode

Recode text to other categories
keywords_rake

Keyword identification using Rapid Automatic Keyword Extraction (RAKE)
txt_recode_ngram

Recode words with compound multi-word expressions
paste.data.frame

Concatenate text of each group of data together
txt_previousgram

Based on a vector with a word sequence, get n-grams (looking backward)
predict.LDA_VEM

Predict method for an object of class LDA_VEM or class LDA_Gibbs
txt_next

Get the n-th next element of a vector
txt_previous

Get the n-th previous element of a vector
strsplit.data.frame

Obtain a tokenised data frame by splitting text alongside a regular expression
txt_nextgram

Based on a vector with a word sequence, get n-grams (looking forward)
brussels_reviews

Reviews of AirBnB customers on Brussels address locations available at www.insideairbnb.com
udpipe_annotation_params

List with training options set by the UDPipe community when building models based on the Universal Dependencies data
udpipe_annotate

Tokenising, Lemmatising, Tagging and Dependency Parsing Annotation of raw text
txt_contains

Check if text contains a certain pattern
cooccurrence

Create a cooccurence data.frame
dtm_reverse

Inverse operation of the document_term_matrix function
dtm_tfidf

Term Frequency - Inverse Document Frequency calculation
txt_collapse

Collapse a character vector while removing missing data.
cbind_morphological

Add morphological features to an annotated dataset
udpipe

Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF format
document_term_matrix

Create a document/term matrix from a data.frame with 1 row per document/term
dtm_colsums

Column sums and Row sums for document term matrices
dtm_cor

Pearson Correlation for Sparse Matrices
dtm_bind

Combine 2 document term matrices either by rows or by columns
dtm_remove_tfidf

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency
udpipe_load_model

Load an UDPipe model
dtm_remove_terms

Remove terms from a Document-Term-Matrix and keep only documents which have a least some terms
udpipe_download_model

Download an UDPipe model provided by the UDPipe community for a specific language of choice
udpipe_accuracy

Evaluate the accuracy of your UDPipe model on holdout data
unique_identifier

Create a unique identifier for each combination of fields in a data frame
txt_highlight

Highlight words in a character vector
txt_tagsequence

Identify a contiguous sequence of tags as 1 being entity
txt_show

Boilerplate function to cat only 1 element of a character vector.
txt_freq

Frequency statistics of elements in a vector
txt_sample

Boilerplate function to sample one element from a vector.
udpipe_train

Train a UDPipe model
udpipe_read_conllu

Read in a CONLL-U file as a data.frame
txt_sentiment

Perform dictionary-based sentiment analysis on a tokenised data frame
as.matrix.cooccurrence

Convert the result of cooccurrence to a sparse matrix
as_cooccurrence

Convert a matrix to a co-occurrence data.frame
as_phrasemachine

Convert Parts of Speech tags to one-letter tags which can be used to identify phrases based on regular expressions
as.data.frame.udpipe_connlu

Convert the result of udpipe_annotate to a tidy data frame
as_conllu

Convert a data.frame to CONLL-U format
as_word2vec

Convert a matrix of word vectors to word2vec format
brussels_listings

Brussels AirBnB address locations available at www.insideairbnb.com
brussels_reviews_anno

Reviews of the AirBnB customers which are tokenised, POS tagged and lemmatised
cbind_dependencies

Add the dependency parsing information to an annotated dataset