2017-08-09 98 views
1

我使用StanfordCoreNLP 2.4.1星火1.5解析中国的句子,但碰到了Java堆OOM异常。代码如下所示:火花斯坦福解析器出现内存不足的

val modelpath = "edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz" 
val lp = LexicalizedParser.loadModel(modelpath) 
val dataWords = data.map(x=>{ 
     val tokens = x.split("\t") 
     val id = tokens(0) 
     val word_seg = tokens(2) 
     val comm_words = word_seg.split("\1").filter(_.split(":").length == 2).map(y=>(y.split(":")(0), y.split(":")(1))) 
     (id, comm_words) 
    }).filter(_._2.nonEmpty) 
val dataSenSlice = dataWords.map(x=>{ 
     val id = x._1 
     val comm_words = x._2 
     val punctuationIndex = Array(0) ++ comm_words.zipWithIndex.filter(_._1._2 == "34").map(_._2) ++ Array(comm_words.length - 1) 
     val senIndex = (punctuationIndex zip punctuationIndex.tail).filter(z => z._1 != z._2) 
     val senSlice = senIndex.map(z=>{ 
     val begin = if (z._1 > 0) z._1 + 1 else z._1 
     val end = if (z._2 == comm_words.length - 1) z._2 + 1 else z._2 
     if (comm_words.slice(begin, end).filter(_._2 != "34").nonEmpty) { 
      val sen = comm_words.slice(begin, end).filter(_._2 != "34").map(_._1).mkString(" ").trim 
      sen 
     } else "" 
     }).filter(l=>l.nonEmpty && l.length<20) 
     (id, senSlice) 
    }).filter(_._2.nonEmpty) 
val dataPoint = dataSenSlice.map(x=>{ 
     val id = x._1 
     val senSlice = x._2 
     val senParse = senSlice.map(y=>{ 
     StanfordNLPParser.senParse(lp, y)// java code wrapped sentence parser 
     }) 
     id + "\t" + senParse.mkString("\1") 
    }) 
dataPoint.saveAsTextFile(PARSED_MERGED_POI) 

我给分析器提供的句子是一个由分段词使用空格连接的句子。

我遇到的例外是:

17/08/09 10:28:15 WARN TaskSetManager: Lost task 1062.0 in stage 0.0 (TID 1219, rz-data-hdp-dn15004.rz.******.com): java.lang.OutOfMemoryError: GC overhead limit exceeded 
at java.util.regex.Pattern.union(Pattern.java:5149) 
at java.util.regex.Pattern.clazz(Pattern.java:2513) 
at java.util.regex.Pattern.sequence(Pattern.java:2030) 
at java.util.regex.Pattern.expr(Pattern.java:1964) 
at java.util.regex.Pattern.compile(Pattern.java:1665) 
at java.util.regex.Pattern.<init>(Pattern.java:1337) 
at java.util.regex.Pattern.compile(Pattern.java:1022) 
at java.util.regex.Pattern.matches(Pattern.java:1128) 
at java.lang.String.matches(String.java:2063) 
at edu.stanford.nlp.parser.lexparser.ChineseUnknownWordModel.score(ChineseUnknownWordModel.java:97) 
at edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel.score(BaseUnknownWordModel.java:124) 
at edu.stanford.nlp.parser.lexparser.ChineseLexicon.score(ChineseLexicon.java:54) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1602) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1634) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.extractBestParse(ExhaustivePCFGParser.java:1635) 

我想知道如果我用正确的方式做句子解析,或者一些其他的东西是错误的。

回答

1

建议:

  1. 增加分区例如数量
 

    data.repartition(500) 

洗牌在RDD数据随机创建更多或更少的分区并在它们之间平衡。这总是通过网络混洗所有数据。

  • 增加执行器和驱动器存储器,例如添加 '火花提交' 参数:
  •  
    
        --executor-memory 8G 
        --driver-memory 4G 
    
    
    +0

    问题解决了,非常感谢! – guan