2016-04-28 56 views
0

我正在使用NLTK工具包进行项目工作。使用我拥有的硬件,我可以在小数据集上运行分类器对象。因此,我将数据分成更小的块并在其中运行分类器对象,同时将所有这些单独的对象存储在pickle文件中。如何合并NLTK中的NaiveBayesClassifier对象

现在为了测试我需要将整个对象作为一个来获得更好的结果。所以我的问题是如何将这些对象合并为一个。

objs = [] 

while True: 
    try: 
     f = open(picklename,"rb") 
     objs.extend(pickle.load(f)) 
     f.close() 
    except EOFError: 
     break 

这样做不起作用。它给出了错误TypeError: 'NaiveBayesClassifier' object is not iterable

NaiveBayesClassifier代码:

classifier = nltk.NaiveBayesClassifier.train(training_set) 
+0

'NaiveBayesClassifier'的代码是怎么样的? – Omid

+0

@Omid它是一个工具包。我编辑了我的问题,显示分类器。 – Arkham

回答

0

我不知道你的数据的确切格式,但你不能简单地合并不同的分类。朴素贝叶斯分类器根据训练数据存储概率分布,并且无法访问原始数据就无法合并概率分布。

如果你看看源代码在这里:http://www.nltk.org/_modules/nltk/classify/naivebayes.html 分类存储的实例:

self._label_probdist = label_probdist 
self._feature_probdist = feature_probdist 

这些在使用相对频率计数火车法计算。 (例如,P(L_1)=(训练集中的L1的数量)/(训练集中的#个标签))要组合这两者,你需要得到(列车1 +列车2中的L1的数量)/在T1 + T2中)

然而,朴素的贝叶斯程序并不是很难从零开始实现,特别是如果您按照上面链接中的“火车”源代码进行操作,下面是一个大纲,使用NaiveBayes源代码代码

  1. 存储 'FreqDist' 为标签和特征的数据的每个子集对象。

    label_freqdist = FreqDist() 
    feature_freqdist = defaultdict(FreqDist) 
    feature_values = defaultdict(set) 
    fnames = set() 
    
    # Count up how many times each feature value occurred, given 
    # the label and featurename. 
    for featureset, label in labeled_featuresets: 
        label_freqdist[label] += 1 
        for fname, fval in featureset.items(): 
         # Increment freq(fval|label, fname) 
         feature_freqdist[label, fname][fval] += 1 
         # Record that fname can take the value fval. 
         feature_values[fname].add(fval) 
         # Keep a list of all feature names. 
         fnames.add(fname) 
    
    # If a feature didn't have a value given for an instance, then 
    # we assume that it gets the implicit value 'None.' This loop 
    # counts up the number of 'missing' feature values for each 
    # (label,fname) pair, and increments the count of the fval 
    # 'None' by that amount. 
    for label in label_freqdist: 
        num_samples = label_freqdist[label] 
        for fname in fnames: 
         count = feature_freqdist[label, fname].N() 
         # Only add a None key when necessary, i.e. if there are 
         # any samples with feature 'fname' missing. 
         if num_samples - count > 0: 
          feature_freqdist[label, fname][None] += num_samples - count 
          feature_values[fname].add(None) 
    # Use pickle to store label_freqdist, feature_freqdist,feature_values 
    
  2. 结合那些使用他们内置的“添加”方法。这将允许您获取所有数据的相对频率。

    all_label_freqdist = FreqDist() 
    all_feature_freqdist = defaultdict(FreqDist) 
    all_feature_values = defaultdict(set) 
    
    for file in train_labels: 
        f = open(file,"rb") 
        all_label_freqdist += pickle.load(f) 
        f.close() 
    
    # Combine the default dicts for features similarly 
    
  3. 使用“估计量”来创建概率分布。

    estimator = ELEProbDist() 
    
    label_probdist = estimator(all_label_freqdist) 
    
    # Create the P(fval|label, fname) distribution 
    feature_probdist = {} 
    for ((label, fname), freqdist) in all_feature_freqdist.items(): 
        probdist = estimator(freqdist, bins=len(all_feature_values[fname])) 
        feature_probdist[label, fname] = probdist 
    
    classifier = NaiveBayesClassifier(label_probdist, feature_probdist) 
    

的分类不会在所有数据相结合的计数,并产生你所需要的。