我试图训练斯坦福NER分类器来识别文本数据库中的特定内容。我制作了一个新的.prop文件和一个培训文件,并且我得到了结果,但是如果我不经过训练即可运行分类器,它们会成为默认结果。我能做什么来适应这个?斯坦福NER不会使用我的培训文件,而是使用它的默认设置
这是我的代码:
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import edu.stanford.nlp.util.StringUtils;
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Properties;public class NLP_train {
public static void main(String[] args) throws IOException {
Properties props = StringUtils.propFileToProperties("C:/Users/Admin/Desktop/trainingfile.prop");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
File inputFile = new File("C:/Users/Admin/Desktop/target.txt");
// create an empty Annotation just with the given text
Annotation document = new Annotation(IOUtils.slurpFileNoExceptions(inputFile));
// run all Annotators on this text
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(CoreAnnotations.TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
System.out.println(String.format("Print: word: [%s] pos: [%s] ne: [%s]", word, pos, ne));
}
}
}
}
这里是我的.prop文件:
trainFile = C:/Users/Admin/Desktop/trainingfile.tsv
serializeTo = C:/Users/Admin/Desktop/ner-model.ser。GZ
地图=字= 0,答案= 1个
useClassFeature =真
useWord =真
useNGrams =真
noMidNGrams =真
useDisjunctive =真
maxNGramLeng = 6
usePrev =真
useNext =真
useSequences =真
usePrevSequences =真
maxLeft = 1
在接下来的4处理字形状设有
useTypeSeqs = true
useTypeSeqs2 =真
useTypeySequences =真
wordShape = chris2useLC
而且我的训练文件的摘录:
的0
型雷达
347G雷达
``0
水稻0
碗0
'' 0