Natural Language Processing for non-English languages with udpipe
BNOSAC is happy to announce the release of the udpipe R package (https://bnosac.github.io/udpipe/en) which is a Natural Language Processing toolkit that provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization', 'morphological feature tagging' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at http://universaldependencies.org/format.html.
Language models
The package provides direct access to language models trained on more than 50 languages. The following languages are directly available:
afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese
We hope that the package will allow other R users to build natural language applications on top of the resulting parts of speech tags, tokens, morphological features and dependency parsing output. And we hope in particular that applications will arise which are not limited to English only (like the textrank R package or the cleanNLP package to name a few)
Easy installation, great docs
- Note that the package has no external software dependencies (no java nor python) and depends only on 2 R packages (Rcpp and data.table), which makes the package easy to install on any platform. The package is available for download at https://CRAN.R-project.org/package=udpipe and is developed at https://github.com/bnosac/udpipe. A small docusaurus website is made available at https://bnosac.github.io/udpipe/en
- We hope you enjoy using it and we would like to thank Milan Straka for all the efforts done on UDPipe as well as all persons involved in http://universaldependencies.org
Training on Text Mining with R
Are you interested in text mining. Feel free to register for the upcoming course on text mining
- Location & Time: March 22-23 2018 in Leuven, Belgium
- Register at https://lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r
Example
Want to get started with it right away? Example annotating Polish text in UTF-8 encoding, but you can pick any language of choice listed above. Enjoy.
library(udpipe)
model <- udpipe_download_model(language = "polish")
model <- udpipe_load_model(file = model$file_model)
x <- udpipe_annotate(model, x = "Budynek otrzymany od parafii wymaga remontu, a placówka nie otrzymała jeszcze żadnej dotacji.")
x <- as.data.frame(x)
x
LATEST BLOG POSTS
- Natural Language Processing for non-English languages with udpipe
- An overview of open data from Belgium
- CRAN search based on natural language processing
- Text Mining with R - upcoming courses in Belgium
- Is udpipe your new NLP processor for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing
- Machine Learning with R - upcoming course in Belgium
- Computer Vision Algorithms for R users
- Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger
- Scheduling R scripts and processes on Windows and Unix/Linux
- Detect Lines in Digital Images
- An overview of text mining visualisations possibilities with R on the CETA trade agreement
- BelgiumMaps.StatBel: R package with Administrative boundaries of Belgium
- Sentiment analysis and Parts of Speech tagging in Dutch/French/English/German/Spanish/Italian
- Text Mining with R - upcoming training schedule
- Good news from Belgium: Course on Applied spatial modelling with R (April 13-14)
- New RStudio add-in to schedule R scripts
- taskscheduleR: R package to schedule R scripts with the Windows task manager
- Web scraping with R & novel classification algorithms on unbalanced data
- Advanced R programming topics course in Leuven
转自:http://www.bnosac.be/index.php/blog/72-natural-language-processing-for-non-english-languages-with-udpipe