python多处理 - 文本处理

我想创建一个多处理版本的文本分类代码，我发现here（其他很酷的东西）。我附加了下面的完整代码。（！？） -python多处理 - 文本处理

我已经试过几件事情第一次尝试lambda函数，但抱怨不被序列化的，所以试图原代码的精简版：

negids = movie_reviews.fileids('neg') 
    posids = movie_reviews.fileids('pos') 

    p = Pool(2) 
    negfeats =[] 
    posfeats =[] 

    for f in negids: 
    words = movie_reviews.words(fileids=[f]) 
    negfeats = p.map(featx, words) #not same form as below - using for debugging 

    print len(negfeats)

不幸即使这不工作 - 我得到以下痕迹：

File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map 
    return self.map_async(func, iterable, chunksize).get() 
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get 
    raise self._value 
ZeroDivisionError: float division

任何想法我可能会做错什么？我应该使用pool.apply_async代替（本身似乎并不能解决问题 - 但也许我正在吠叫错误的树）？

import collections 
import nltk.classify.util, nltk.metrics 
from nltk.classify import NaiveBayesClassifier 
from nltk.corpus import movie_reviews 

def evaluate_classifier(featx): 
    negids = movie_reviews.fileids('neg') 
    posids = movie_reviews.fileids('pos') 

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids] 
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids] 

    negcutoff = len(negfeats)*3/4 
    poscutoff = len(posfeats)*3/4 

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] 
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] 

    classifier = NaiveBayesClassifier.train(trainfeats) 
    refsets = collections.defaultdict(set) 
    testsets = collections.defaultdict(set) 

    for i, (feats, label) in enumerate(testfeats): 
      refsets[label].add(i) 
      observed = classifier.classify(feats) 
      testsets[observed].add(i) 

    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats) 
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos']) 
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos']) 
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg']) 
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg']) 
    classifier.show_most_informative_features()

来源

2010-06-20 malangi

关于您的精简版，您是否使用了与http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/中使用的不同的featx功能？

这个异常很可能发生在featx之内，而多处理只是重新引发了它，尽管它并不包含原始的回溯，这使得它有点无益。

尝试在没有使用pool.map（）的情况下先运行它（即negfeats = [feat(x) for x in words]），或者包含可以调试的featx内容。

如果仍然没有帮助，请在原始问题中发布您正在处理的整个脚本（如果可能，请尽量简化），以便其他人可以运行该脚本并提供更直接的答案。请注意，下面的代码片段实际上起作用（调整您的精简版本）：

from nltk.corpus import movie_reviews 
from multiprocessing import Pool 

def featx(words): 
    return dict([(word, True) for word in words]) 

if __name__ == "__main__": 
    negids = movie_reviews.fileids('neg') 
    posids = movie_reviews.fileids('pos') 

    p = Pool(2) 
    negfeats =[] 
    posfeats =[] 

    for f in negids: 
     words = movie_reviews.words(fileids=[f]) 
     negfeats = p.map(featx, words) 

    print len(negfeats)

来源

2010-06-21 03:28:48

这是我认为的问题 - 非常感谢！ – malangi 2010-06-21 16:29:14

尝试在没有使用pool.map（）的情况下运行它（例如，negfeats = [feat（x）for x in words]）非常感谢。 – Lavanya 2012-01-06 07:39:42

您是试图将分类，训练还是两者并行化？您可以很容易地让单词计数和评分平行，但我不确定特征提取&培训。对于分类，我建议execnet。我使用它的并行/分布式词性tagging有很好的结果。

execnet的基本思想是你需要训练一个分类器，然后将它发送到每个execnet节点。接下来，将这些文件分割到每个节点，然后让它将每个文件分配给它。结果然后被发送回主节点。我还没有尝试过腌制分类器，所以我不确定这是否会起作用，但如果pos标记器可以被腌制，我会假定分类器也可以。

来源

2010-06-20 22:52:31 Jacob

我刚开始尝试酸洗 - 但它们变得相当沉重（100mb ish）。我会试着看看我能否以某种方式让多重处理工作，否则execnet好像是另一种选择 - 我怀疑训练可以并行（容易），但正如你所说，其他位和bobs不应该是差异。。希望。 btw感谢streamhacker上的东西 - 它的宝库！ – malangi 2010-06-20 23:06:37

python多处理 - 文本处理

回答

相关问题