我正在给用户提供一个选项,以便在为ngram频率过滤文本主体时包含停用词。当使用ngram频率时,Lucene输出中带有停用词的下划线
snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English", stopWords);
shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, this.getnGramLength());
禁用词被设定为词组前后的完整列表中的n-gram包括或从中删除:通常情况下,如下这样做。 this.getnGramLength());仅包含当前的ngram长度,最多可达三个。
如果我使用的过滤文本禁用词“卫星肯定是落到地球”的卦,输出为:
No=1, Key=to, Freq=1
No=2, Key=definitely, Freq=1
No=3, Key=falling to earth, Freq=1
No=4, Key=satellite, Freq=1
No=5, Key=is, Freq=1
No=6, Key=definitely falling to, Freq=1
No=7, Key=definitely falling, Freq=1
No=8, Key=falling, Freq=1
No=9, Key=to earth, Freq=1
No=10, Key=satellite is, Freq=1
No=11, Key=is definitely, Freq=1
No=12, Key=falling to, Freq=1
No=13, Key=is definitely falling, Freq=1
No=14, Key=earth, Freq=1
No=15, Key=satellite is definitely, Freq=1
但是,如果我不使用卦停用词,输出是这样的:
No=1, Key=satellite, Freq=1
No=2, Key=falling _, Freq=1
No=3, Key=satellite _ _, Freq=1
No=4, Key=_ earth, Freq=1
No=5, Key=falling, Freq=1
No=6, Key=satellite _, Freq=1
No=7, Key=_ _, Freq=1
No=8, Key=_ falling _, Freq=1
No=9, Key=falling _ earth, Freq=1
No=10, Key=_, Freq=3
No=11, Key=earth, Freq=1
No=12, Key=_ _ falling, Freq=1
No=13, Key=_ falling, Freq=1
为什么我看到下划线?我会想到看到简单的unigrams,“卫星坠落”,“坠落地球”和“卫星坠落地球”?绝对是在我使用的停用词组中。
我就可以过滤掉下划线的结果,但...