2014-09-21 90 views
0

我想我或多或少地理解朴素贝叶斯,但是我对于其简单的二进制文本分类tast的实现有几个问题。基本概念:朴素贝叶斯算法的分类

假设文件D_i就是词汇的某个子集x_1, x_2, ...x_n

有两类c_i任何文件可以落在了,我想计算P(c_i|D)某些输入文档d成比例P(D|c_i)P(c_i)

我有三个问题

  1. P(c_i)#docs in c_i/ #total docs#words in c_i/ #total words
  2. 应该P(x_j|c_i)#times x_j appears in D/ #times x_j appears in c_i
  3. 假设一个x_j不训练集中存在了,我给它的1的概率,这样它不会改变计算?

例如,让我们说,我有一个训练集:

training = [("hello world", "good") 
      ("bye world", "bad")] 

这样的类必须

good_class = {"hello": 1, "world": 1} 
bad_class = {"bye":1, "world:1"} 
all = {"hello": 1, "world": 2, "bye":1} 

所以现在如果我想计算的概率测试字符串不错

test1 = ["hello", "again"] 
p_good = sum(good_class.values())/sum(all.values()) 
p_hello_good = good_class["hello"]/all["hello"] 
p_again_good = 1 # because "again" doesn't exist in our training set 

p_test1_good = p_good * p_hello_good * p_again_good 

回答

1

由于这个问题太过分了广告,所以我只能回答一个限制方式: -

1: - P(C_I)是C_I /#合计文档#docs或#words在C_I /#合计的话

P(c_i) = #c_i/#total docs 

第二个: -如果P(x_j | c_i)是#times x_j出现在D/#times x_j出现在c_i中。
后@larsmans注意到..

It is exactly occurrence of word in a document 
by total number of words in that class in whole dataset. 

3: -假设一个x_j在训练集不存在了,我给它的1的概率,使得其不会改变计算?

For That we have laplace correction or Additive smoothing. It is applied on 
p(x_j|c_i)=(#times x_j appears in D+1)/ (#times x_j +|V|) which will neutralize 
the effect not occurring features. 
+2

号,P(xⱼ|cᵢ)是类cᵢxⱼ的频率,通过项的总数在类的所有文件分。 – 2014-09-21 14:06:22

+0

@larsmans对不起,我没有注意到.... – Devavrata 2014-09-22 17:20:29