2017-07-25 47 views
0

我试图使用最大熵分类器的OpenNLP实现,但它似乎文档是相当缺乏,尽管这个库显然设计为易于使用我无法找到一个单一的例子和/或规范的输入文件格式(即训练集)。MaxEnt OpenNLP实施的输入格式?

任何人都知道在哪里可以找到这个或一个最小的培训示例?

回答

3

OpenNLP的格式非常灵活。如果您想在OpenNLP中使用MaxEnt分类器,则需要执行几个步骤。

下面是示例代码注释:

package example; 

import java.io.File; 
import java.io.IOException; 
import java.nio.charset.Charset; 
import java.util.Arrays; 
import java.util.HashMap; 
import java.util.Map; 

import opennlp.tools.ml.maxent.GISTrainer; 
import opennlp.tools.ml.model.Event; 
import opennlp.tools.ml.model.MaxentModel; 
import opennlp.tools.tokenize.WhitespaceTokenizer; 
import opennlp.tools.util.FilterObjectStream; 
import opennlp.tools.util.MarkableFileInputStreamFactory; 
import opennlp.tools.util.ObjectStream; 
import opennlp.tools.util.PlainTextByLineStream; 
import opennlp.tools.util.TrainingParameters; 

public class ReadData { 


    public static void main(String[] args) throws Exception{ 

     // this is the data file ... 
     // the format is <LIST of FEATURES separated by spaces> <outcome> 
     // change the file to fit your needs 
     File f=new File("football.dat"); 

     // we need to create an ObjectStream of events for the trainer.. 
     // First create an InputStreamFactory -- given a file we can create an InputStream, required for resetting... 
     MarkableFileInputStreamFactory factory=new MarkableFileInputStreamFactory(f); 
     // create a PlainTextByLineInputStream -- Note: you can create your own Stream that can handle binary files or data that 
     //          --  crosses two line... 
     ObjectStream<String> stream=new PlainTextByLineStream(factory, Charset.defaultCharset()); 
     // Now you have a stream of string you need to convert it to a stream of events... 
     // I use a custom FilterObjectStream which simply takes a line, breaks it up into tokens, 
     // uses all except the last as the features [context] and the last token as the outcome class 
     ObjectStream<Event> eventStream=new FilterObjectStream<String, Event>(stream) { 
      @Override 
      public Event read() throws IOException { 
       String line=samples.read(); 
       if (line==null) return null; 

       String[] parts=WhitespaceTokenizer.INSTANCE.tokenize(line); 
       String[] context=Arrays.copyOf(parts, parts.length-1); 

       System.out.println(parts[parts.length-1]+" "+Arrays.toString(context)); 
       return new Event(parts[parts.length-1], context); 
      } 
     }; 


     TrainingParameters parameters=new TrainingParameters(); 
     // By default OpenNLP uses a cutoff of 5 (a feature has to occur 5 times before it is used) 
     // use 1 for my small dataset 
     parameters.put(GISTrainer.CUTOFF_PARAM, 1); 

     GISTrainer trainer=new GISTrainer(); 
     // the report map is supposed to mark when default values are assigned... 
     Map<String,String> reportMap=new HashMap<>(); 
     // DONT FORGET TO INITIALIZE THE TRAINER!!! 
     trainer.init(parameters, reportMap); 
     MaxentModel model=trainer.train(eventStream); 

     // Now we have a model -- you should test on a test set, but 
     // this is a toy example... so I am just resetting the eventstream. 
     eventStream.reset(); 
     Event evt=null; 
     while ((evt=eventStream.read())!=null){ 
      System.out.print(Arrays.toString(evt.getContext())+": "); 
      // Evaluate the context from the event using our model. 
      // you would want to calculate summary statistics.. 
      double[] p=model.eval(evt.getContext()); 
      System.out.print(model.getBestOutcome(p)+" "); 
      if (model.getBestOutcome(p).equals(evt.getOutcome())){ 
       System.out.println("CORRECT"); 
      }else{ 
       System.out.println("INCORRECT");     
      } 
     } 

    } 

} 

Football.dat:

home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal 
home=man_united Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous man_united 
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie 
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous tie 
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal 
home=man_united Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united 
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous man_united 
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal 
home=arsenal Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous arsenal 
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie 
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united 
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal 
home=arsenal Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united 
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal 

希望它有助于

+0

已经解决了这个问题,但感谢全面的回答! –

+0

你是否也知道如何传递文本特征和数字特征?也就是说,如果我想要通过例如数字传递,我如何告诉系统将某些值解释为数字。一个真实价值的向量作为特征? –

+0

(对不起,它已经有一段时间了...)我不确定OpenNLP处理数字特征。你有没有考虑使用逻辑回归分类和数值? – HowYaDoing