NLP - Using Stanza library

Stanza Text Processing

Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit [1].

The following is a set of English sentences from Chapter 1 ("A SCANDAL IN BOHEMIA") of the book The Adventures of Sherlock Holmes by Sr. Arthur Conan Doyle.

Step 1 - Downloading model

Download an English model into the default directory. This command should not always be executed, but only the first time an English model is used or when it needs to be updated.

Step 2 - Creating pipeline

Step 3 - Accessing annotations

NLP task: Splitting Sentences. Show number of sentences in the text.

NLP task: Part of Speech tagging. Show annotations on the words of the sentences.

NLP task: Named Entity Recognition.

Stanford CoreNLP interface

CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish [2].

NLP task: Constituency Parse Tree.

Reference

[1] Stanza home page.
[2] Stanford CoreNLP home page.


« Home