2016-03-05 59 views
0

我遵循NLP教程here(6'58'') - 关于愚蠢退避平滑算法的部分。 在视频教程和implementation of bi-gram level stupid-backoff,他们使用的折扣值= 0.4愚蠢退避中的折扣值

Tutorial Slide

实现两字级的退避:

def score(self, sentence): 
    score = 0.0 
    previous = sentence[0] 
    for token in sentence[1:]: 
     bicount = self.bigramCounts[(previous, token)] 
     bi_unicount = self.unigramCounts[previous] 
     unicount = self.unigramCounts[token] 
     if bicount > 0: 
      score += math.log(bicount) 
      score -= math.log(bi_unicount) 
     else: 
      score += math.log(0.4)  // discount here 
      score += math.log(unicount + 1) 
      score -= math.log(self.total + self.vocab_size) 
     previous = token 
    return score 

但随后trigram-level implementation,贴现值是1

def score(self, sentence): 
    score = 0.0 
    fst = sentence[0] 
    snd = sentence[1] 
    for token in sentence[2:]: 
     tricount = self.trigramCounts[(fst, snd, token)] 
     tri_bicount = self.bigramCounts[(fst, snd)] 
     bicount = self.bigramCounts[(snd, token)] 
     bi_unicount = self.unigramCounts[snd] 
     unicount = self.unigramCounts[token] 
     if tricount > 0: 
      score += math.log(tricount) 
      score -= math.log(tri_bicount) 
     elif bicount > 0: 
      score += math.log(bicount)    // no discount here 
      score -= math.log(bi_unicount) 
     else: 
      score += math.log((unicount + 1))  // no discount here 
      score -= math.log(self.total + self.vocab_size) 
     fst, snd = snd, token 
    return score 

当我跑project - 与折扣设置0.4和1的三克的水平,我得到的分数:

tri-gram with discount = 0.4 < bi-gram with discount = 0.4 < tri-gram with discount =1

这很容易知道为什么 - 有折扣= 0.4,成为三克的最终else

else: 
    score += math.log(0.4)  // -> -0.3979 
    score += math.log(0.4)  // -> -0.3979 
    score += math.log((unicount + 1))  // no discount here 
    score -= math.log(self.total + self.vocab_size) 

所以我真的很困惑 - 0.4值是从哪里来的?

+0

0.4在愚蠢的退避? – user3639557

+0

@ user3639557是的,但我不知道为什么它是0.4,为什么在trigram例子中,他们不使用这个折扣。 – user3448806

+0

这是非常随意的,这就是为什么他们把它称为愚蠢回退。阅读以下答案中引用的论文。 – user3639557

回答