如何合并NLTK中的NaiveBayesClassifier对象

我正在使用NLTK工具包进行项目工作。使用我拥有的硬件，我可以在小数据集上运行分类器对象。因此，我将数据分成更小的块并在其中运行分类器对象，同时将所有这些单独的对象存储在pickle文件中。如何合并NLTK中的NaiveBayesClassifier对象

现在为了测试我需要将整个对象作为一个来获得更好的结果。所以我的问题是如何将这些对象合并为一个。

objs = [] 

while True: 
    try: 
     f = open(picklename,"rb") 
     objs.extend(pickle.load(f)) 
     f.close() 
    except EOFError: 
     break

这样做不起作用。它给出了错误TypeError: 'NaiveBayesClassifier' object is not iterable。

NaiveBayesClassifier代码：

classifier = nltk.NaiveBayesClassifier.train(training_set)

来源

2016-04-28 Arkham

'NaiveBayesClassifier'的代码是怎么样的？ – Omid

@Omid它是一个工具包。我编辑了我的问题，显示分类器。 – Arkham

我不知道你的数据的确切格式，但你不能简单地合并不同的分类。朴素贝叶斯分类器根据训练数据存储概率分布，并且无法访问原始数据就无法合并概率分布。

如果你看看源代码在这里：http://www.nltk.org/_modules/nltk/classify/naivebayes.html 分类存储的实例：

self._label_probdist = label_probdist 
self._feature_probdist = feature_probdist

这些在使用相对频率计数火车法计算。（例如，P（L_1）=（训练集中的L1的数量）/（训练集中的＃个标签））要组合这两者，你需要得到（列车1 +列车2中的L1的数量）/在T1 + T2中）

然而，朴素的贝叶斯程序并不是很难从零开始实现，特别是如果您按照上面链接中的“火车”源代码进行操作，下面是一个大纲，使用NaiveBayes源代码代码

存储 'FreqDist' 为标签和特征的数据的每个子集对象。

label_freqdist = FreqDist() 
feature_freqdist = defaultdict(FreqDist) 
feature_values = defaultdict(set) 
fnames = set() 

# Count up how many times each feature value occurred, given 
# the label and featurename. 
for featureset, label in labeled_featuresets: 
    label_freqdist[label] += 1 
    for fname, fval in featureset.items(): 
     # Increment freq(fval|label, fname) 
     feature_freqdist[label, fname][fval] += 1 
     # Record that fname can take the value fval. 
     feature_values[fname].add(fval) 
     # Keep a list of all feature names. 
     fnames.add(fname) 

# If a feature didn't have a value given for an instance, then 
# we assume that it gets the implicit value 'None.' This loop 
# counts up the number of 'missing' feature values for each 
# (label,fname) pair, and increments the count of the fval 
# 'None' by that amount. 
for label in label_freqdist: 
    num_samples = label_freqdist[label] 
    for fname in fnames: 
     count = feature_freqdist[label, fname].N() 
     # Only add a None key when necessary, i.e. if there are 
     # any samples with feature 'fname' missing. 
     if num_samples - count > 0: 
      feature_freqdist[label, fname][None] += num_samples - count 
      feature_values[fname].add(None) 
# Use pickle to store label_freqdist, feature_freqdist,feature_values

结合那些使用他们内置的“添加”方法。这将允许您获取所有数据的相对频率。

all_label_freqdist = FreqDist() 
all_feature_freqdist = defaultdict(FreqDist) 
all_feature_values = defaultdict(set) 

for file in train_labels: 
    f = open(file,"rb") 
    all_label_freqdist += pickle.load(f) 
    f.close() 

# Combine the default dicts for features similarly

使用“估计量”来创建概率分布。

estimator = ELEProbDist() 

label_probdist = estimator(all_label_freqdist) 

# Create the P(fval|label, fname) distribution 
feature_probdist = {} 
for ((label, fname), freqdist) in all_feature_freqdist.items(): 
    probdist = estimator(freqdist, bins=len(all_feature_values[fname])) 
    feature_probdist[label, fname] = probdist 

classifier = NaiveBayesClassifier(label_probdist, feature_probdist)

的分类不会在所有数据相结合的计数，并产生你所需要的。

来源

2016-05-02 19:43:41 akornilo

如何合并NLTK中的NaiveBayesClassifier对象

回答

相关问题