2017-10-12 273 views
1

我正在尝试向nl-personTest.bin文件添加额外的训练数据,其中OpenNLP。 现在是我的问题,当我运行我的代码添加额外的训练数据时,它将删除已经存在的数据并只添加我的新数据。将训练数据添加到现有模型(bin文件)

如何添加额外的训练数据而不是替换它?

我没有使用下面的代码,(得到它来自Open NLP NER is not properly trained

public class TrainNames 
    { 
    public static void main(String[] args) 
    { 
     train("nl", "person", "namen.txt", "nl-ner-personTest.bin"); 
    } 

    public static String train(String lang, String entity,InputStreamFactory inputStream, FileOutputStream modelStream) { 

     Charset charset = Charset.forName("UTF-8"); 
     TokenNameFinderModel model = null; 
     ObjectStream<NameSample> sampleStream = null; 
     try { 
      ObjectStream<String> lineStream = new PlainTextByLineStream(inputStream, charset); 
      sampleStream = new NameSampleDataStream(lineStream); 
      TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory(); 
      model = NameFinderME.train("nl", "person", sampleStream, TrainingParameters.defaultParams(), 
       nameFinderFactory); 
     } catch (FileNotFoundException fio) { 

     } catch (IOException io) { 

     } finally { 
      try { 
       sampleStream.close(); 
      } catch (IOException io) { 

      } 
     } 
     BufferedOutputStream modelOut = null; 
     try { 
      modelOut = new BufferedOutputStream(modelStream); 
      model.serialize(modelOut); 
     } catch (IOException io) { 

     } finally { 
      if (modelOut != null) { 
       try { 
        modelOut.close(); 
       } catch (IOException io) { 

       } 
      } 
     } 
     return "Something goes wrong with training module."; 
    } 

    public static String train(String lang, String entity, String taggedCoprusFile, 
           String modelFile) { 
     try { 
      InputStreamFactory inputStream = new InputStreamFactory() { 
       FileInputStream fileInputStream = new FileInputStream("namen.txt"); 

       public InputStream createInputStream() throws IOException { 
        return fileInputStream; 
       } 
      }; 

      return train(lang, entity, inputStream, 
       new FileOutputStream(modelFile)); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } 
     return "Something goes wrong with training module."; 
    } } 

任何人任何想法来解决这个问题?

因为如果我想有一个准确的训练集,我需要至少有15K 句子说文档。

回答

0

我认为OpenNLP不支持扩展现有的二进制NLP模型。

如果您有所有可用的培训数据,请将它们全部收集起来,然后立即进行培训。您可以使用SequenceInputStream。我修改您的示例使用另一个InputStreamFactory

public String train(String lang, String entity, InputStreamFactory inputStream, FileOutputStream modelStream) { 

    // .... 
    try { 
     ObjectStream<String> lineStream = new PlainTextByLineStream(trainingDataInputStreamFactory(Arrays.asList(
       new File("trainingdata1.txt"), 
       new File("trainingdata2.txt"), 
       new File("trainingdata3.txt") 
     )), charset); 

     // ... 
    } 

    // ... 
} 

private InputStreamFactory trainingDataInputStreamFactory(List<File> trainingFiles) { 
    return new InputStreamFactory() { 
     @Override 
     public InputStream createInputStream() throws IOException { 
      List<InputStream> inputStreams = trainingFiles.stream() 
        .map(f -> { 
         try { 
          return new FileInputStream(f); 
         } catch (FileNotFoundException e) { 
          e.printStackTrace(); 
          return null; 
         } 
        }) 
        .filter(Objects::nonNull) 
        .collect(Collectors.toList()); 

      return new SequenceInputStream(new Vector<>(inputStreams).elements()); 
     } 
    }; 
} 
+0

感谢@Schrieveslaach – Patrick

+1

@Patrick,只为您的信息:我正在开发一个工具集,它可以帮助您从标注的语料库创建NLP模型。请看看[这里](https://git.noc.fh-aachen.de/marc.schreiber/Towards-Effective-NLP-Application-Development),如果您有任何问题,请告诉我。 ;-) – Schrieveslaach

+0

谢谢,我会看看它。@ Schrieveslaach – Patrick