Stanford NLP

Using the Stanford pipeline

In this section, we will discuss the Stanford pipeline in more detail. Although we have used it in several examples in this book, we have not fully explored its capabilities. Having used this pipeline before, you are now in a better position to understand how it can be used. Upon reading this section, you will be able to better assess its capabilities and applicability to your needs.

The edu.stanford.nlp.pipeline package holds the StanfordCoreNLP and annotator classes. The general approach uses the following code sequence where the text string is processed. The Properties class holds the annotation names as shown here:

String text = "The robber took the cash and ran."; Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

The Annotation class represents the text to be processed. The constructor, used in the next codesegment, takes the string and adds a CoreAnnotations.TextAnnotation instance to the Annotation object. The StanfordCoreNLP class' annotate method will apply the annotations specified in the property list to the Annotation object:

Annotation annotation = new Annotation(text); pipeline.annotate(annotation);

CoreMap interface is the base interface for all annotable objects. It uses class objects for keys. The TextAnnotation annotation type is a CoreMap key for the text. A CoreMap key is intended to be used with various types of annotations such as those defined in the properties list. The value depends on the key type.

The hierarchy of classes and interfaces is depicted in the following diagram. It is a simplified version of the relationship between classes and interfaces as they relate to the the pipeline. The horizontal lines represent interface implementations and the vertical lines represent inheritance between classes.

To verify the effect of the annotate method, we will use the following code sequence. The keysetmethod returns a set of all of the annotation keys currently held by the Annotation object. These keys are displayed before and after the annotate method is applied:

System.out.println("Before annotate method executed "); Set<Class<?>> annotationSet = annotation.keySet(); for(Class c : annotationSet) { System.out.println("\tClass: " + c.getName()); } pipeline.annotate(annotation); System.out.println("After annotate method executed "); annotationSet = annotation.keySet(); for(Class c : annotationSet) { System.out.println("\tClass: " + c.getName()); }

The following output shows that the creation of the Annotation object resulted in the TextAnnotation extension being added to the annotation. After the annotate method is executed, several additional annotations have been applied:

Before annotate method executed Class: edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation After annotate method executed Class: edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation Class: edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation Class: edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation Class: edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation

The CoreLabel class implements the CoreMap interface. It represents a single word with annotation information attached to it. The information attached depends on the properties set when the pipeline is created. However, there will always be positional information available such as its beginning and ending position or the whitespace before and after the entity.

The get method for either CoreMap or CoreLabel returns information specific to its argument. The get method is overloaded and returns a value dependent on the type of its argument. For example, here is the declaration of the SentencesAnnotation class. It implements CoreAnnotation<List<CoreMap>>:

public static class CoreAnnotations.SentencesAnnotation extends Object implements CoreAnnotation<List<CoreMap>>

When used in the following statement, the SentencesAnnotation class returns a List<CoreMap>instance:

List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);

In a similar manner, the TokensAnnotation class implements CoreAnnotation<List<CoreLabel>> as shown here:

public static class CoreAnnotations.TokensAnnotation extends Object implements CoreAnnotation<List<CoreLabel>>

Its get method returns a list of CoreLabel objects that are used within a for-each statement:

for (CoreLabel token : sentence.get(TokensAnnotation.class)) {

In previous chapters, we have used the SentencesAnnotation class to access the sentences in an annotation, as shown here:

List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);

The CoreLabel class has been used to access individual words in a sentence as demonstrated here:

for (CoreMap sentence : sentences) { for (CoreLabel token: sentence.get(TokensAnnotation.class)) { String word = token.get(TextAnnotation.class); String pos = token.get(PartOfSpeechAnnotation.class); } }

Annotator options can be found at http://nlp.stanford.edu/software/corenlp.shtml. The following code example illustrates how to use an annotator to specify the POS model. The pos.model property is set to the model desired using the Property class' put method:

props.put("pos.model", "C:/.../Models/english-caseless-left3words-distsim.tagger");

A summary of the annotators is found in the following table. The first column is the string used in the properties' list. The second column lists only the basic annotation class, and the third column specifies how it is typically used:

Reference:

Natural Language Processing with Java
https://github.com/Einext/nlp-examples