2016-11-15 36 views
0

我发现在train.txt中训练情感模型的数据是PTB格式,看起来像这样。创建另一个train.txt来训练其他域的情感模型

(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .)))) 

其真正的句子应该是

Yet the act is still charming here. 

但是解析后,我得到了不同的结构

(ROOT (S (CC Yet) (NP (DT the) (NN act)) (VP (VBZ is) (ADJP (RB still) (JJ charming)) (ADVP (RB here))) (. .))) 

跟随我的代码:

public static void main(String args[]){ 
    // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
    Properties props = new Properties(); 
    props.setProperty("annotators", "tokenize, ssplit,parse"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 

    // read some text in the text variable 
    String text = "Yet the act is still charming here .";// Add your text here! 

    // create an empty Annotation just with the given text 
    Annotation annotation = new Annotation(text); 

    // run all Annotators on this text 

    pipeline.annotate(annotation); 

    // these are all the sentences in this document 
    // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types 
    List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class); 

    // int sentiment = 0; 
    for(CoreMap sentence: sentences) { 
     // traversing the words in the current sentence 
     Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class); 
     System.out.println(tree); 
     // System.out.println(tree.yield()); 
     tree.pennPrint(System.out); 
     // Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class); 
     // sentiment = RNNCoreAnnotations.getPredictedClass(tree); 
    } 

    // System.out.print(sentiment); 
} 

然后两个问题出现当我使用m y自己的句子来创建train.txt。

1.我的树不同于train.txt中的树,我知道后者中的数字是情感的极性。但似乎树结构不同,我想要得到一个二值化的分析树,它可能看起来像这样

((Yet) (((the) (act)) ((((is) ((still) (charming))) (here)) (.)))) 

一旦我得到的感悟号码,我可以填满它让我自己train.txt

2.How得到的二值化解析树的每个节点都短语,在这个例子中,我应该得到

Yet 
the 
act 
the act 
is 
still 
charming 
still charming 
is still charming 
here 
is still charming here 
. 
is still charming here . 
the act is still charming here . 
Yet the act is still charming here. 

一旦我得到它们,我可以花钱注释他们的人类注解。

其实我google了他们很多,但不能解决它们,所以我张贴here.Any有用的答案将不胜感激!

回答

2

这添加到属性来获取二叉树:

props.setProperty("parse.binaryTrees", "true"); 

这句话的二叉树将要访问的是这样的:

Tree tree = sentence.set(TreeCoreAnnotations.BinarizedTreeAnnotation.class); 

下面是一些示例代码,我写了:

import edu.stanford.nlp.ling.CoreAnnotations; 
import edu.stanford.nlp.ling.Word; 
import edu.stanford.nlp.pipeline.Annotation; 
import edu.stanford.nlp.pipeline.StanfordCoreNLP; 
import edu.stanford.nlp.trees.*; 

import java.util.ArrayList; 
import java.util.Properties; 

public class SubTreesExample { 

    public static void printSubTrees(Tree inputTree, String spacing) { 
     if (inputTree.isLeaf()) { 
      return; 
     } 
     ArrayList<Word> words = new ArrayList<Word>(); 
     for (Tree leaf : inputTree.getLeaves()) { 
      words.addAll(leaf.yieldWords()); 
     } 
     System.out.print(spacing+inputTree.label()+"\t"); 
     for (Word w : words) { 
      System.out.print(w.word()+ " "); 
     } 
     System.out.println(); 
     for (Tree subTree : inputTree.children()) { 
      printSubTrees(subTree, spacing + " "); 
     } 
    } 

    public static void main(String[] args) { 
     Properties props = new Properties(); 
     props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse"); 
     props.setProperty("parse.binaryTrees", "true"); 
     StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
     String text = "Yet the act is still charming here."; 
     Annotation annotation = new Annotation(text); 
     pipeline.annotate(annotation); 
     Tree sentenceTree = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0).get(
       TreeCoreAnnotations.BinarizedTreeAnnotation.class); 
     System.out.println("Penn tree:"); 
     sentenceTree.pennPrint(System.out); 
     System.out.println(); 
     System.out.println("Phrases:"); 
     printSubTrees(sentenceTree, ""); 

    } 
} 
+0

太棒了!如果我想训练一个中国情感模型,那么train.txt中的语句仍然需要进行二进制解析? @StanfordNLPHelp – ryh