结合nltk.RegexpParser语法

正如我对学习更多关于NLP下一步，我想实现一个简单的启发式改善超出了简单的n-gram结果。结合nltk.RegexpParser语法

根据下面链接的斯坦福搭配PDF，他们提到通过“只通过可能成为”短语“的那些模式的部分语音过滤器传递”候选短语“将产生比简单地使用最频繁的结果更好的结果存在的双克来源：搭配，第143页 - 144：https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

144页上的表中有7个标签图案在顺序，NLTK POS标签等效为：

JJ NN

。 NN

JJ JJ NN

JJ NN NN

NN JJ NN

NN NN NN

NN IN NN

在下面的代码，我可以得到所希望的结果时我独立以下应用每个语法。但是，当我尝试合并相同的语法时，我没有收到预期的结果。

在我的代码，你可以看到，我去掉一个句子中，取消1个语法，运行它，并检查结果。

我应该能够通过合并语法（只是在下面的代码，其中3）所有的句子组合，运行它，并得到想要的结果。

我的问题是，我该如何正确地结合语法？

我假设，结合语法就像是一个“OR”，发现这个图案，或者这种模式...

在此先感谢。

import nltk 

# The following sentences are correctly grouped with <JJ>*<NN>+. 
# Should see: 'linear function', 'regression coefficient', 'Gaussian random variable' and 
# 'cumulative distribution function' 
SampleSentence = "In mathematics, the term linear function refers to two distinct, although related, notions" 
#SampleSentence = "The regression coefficient is the slope of the line of the regression equation." 
#SampleSentence = "In probability theory, Gaussian random variable is a very common continuous probability distribution." 
#SampleSentence = "In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x." 

# The following sentences are correctly grouped with <NN.?>*<V.*>*<NN> 
# Should see 'mean squared error' and # 'class probability function'. 
#SampleSentence = "In statistics, the mean squared error (MSE) of an estimator measures the average of the squares of the errors, that is, the difference between the estimator and what is estimated." 
#SampleSentence = "The class probability function is interesting" 

# The sentence below is correctly grouped with <NN.?>*<IN>*<NN.?>*. 
# should see 'degrees of freedom'. 
#SampleSentence = "In statistics, the degrees of freedom is the number of values in the final calculation of a statistic that are free to vary." 

SampleSentence = SampleSentence.lower() 

print("\nFull sentence: ", SampleSentence, "\n") 

tokens = nltk.word_tokenize(SampleSentence) 
textTokens = nltk.Text(tokens)  

# Determine the POS tags. 
POStagList = nltk.pos_tag(textTokens)  

# The following grammars work well *independently* 
grammar = "NP: {<JJ>*<NN>+}" 
#grammar = "NP: {<NN.?>*<V.*>*<NN>}"  
#grammar = "NP: {<NN.?>*<IN>*<NN.?>*}" 


# Merge several grammars above into a single one below. 
# Note that all 3 correct grammars above are included below. 

''' 
grammar = """ 
      NP: 
       {<JJ>*<NN>+} 
       {<NN.?>*<V.*>*<NN>} 
       {<NN.?>*<IN>*<NN.?>*} 
     """ 
''' 

cp = nltk.RegexpParser(grammar) 

result = cp.parse(POStagList) 

for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'): 
    print("NP Subtree:", subtree)

来源

2017-10-08 RandomTask

如果你能帮助我了解更多，你不想写3个独立的行这样的语法= “”” NP： { * +} { * * } { * * *} “”“。相反，你需要一个单行的正则表达式模式，可以容纳所有3种模式。 –

嗨拉胡尔。我想，让他们产生，他们分别产生相同的结果以某种方式结合3种正则表达式模式。我很公正，如何用1，2，3以上的线条写出来。我会在接下来的几天尝试下面的代码。谢谢。 – RandomTask

当然，继续！我已经尝试过多种场景，并且它已成功。尝试并回到其他任何问题 –

如果我的意见是，你在找什么，那么下面就是答案：

grammar = """ 
      NP: 
       {<JJ>*<NN.?>*<V.|IN>*<NN.?>*}"""

来源

2017-10-17 09:07:39

结合nltk.RegexpParser语法

回答

相关问题