2013-07-08 61 views
1

我收到一个错误,我不明白,当试图执行一些python代码。我试图通过优秀的NLTK教科书来学习使用自然语言工具包。在尝试下面的代码时(我为自己的数据修改了图2.1),我收到下面的错误。python断言错误nltk.ConditionalFreqDistribution

代码,我跑:

import os, re, csv, string, operator 
import nltk 
from nltk.corpus import PlaintextCorpusReader 
dir = '/Dropbox/hearings' 

corpus_root = dir 
text = PlaintextCorpusReader(corpus_root, ".*") 

cfd = nltk.ConditionalFreqDist(
    (target, fileid[:3]) 
    for fileid in text.fileids() 
    for w in text.words(fileid) 
    for target in ['budget','appropriat'] 
    if w.lower().startswith(target)) 

cfd.plot() 

错误我收到(全回溯):

In [6]: --------------------------------------------------------------------------- 
AssertionError       Traceback (most recent call last) 
<ipython-input-6-abc9ff8cb2f1> in <module>() 
----> 1 execfile(r'/Dropbox/hearings/hearings_ingest.py') # PYTHON-MODE 

/Dropbox/hearings/hearings_ingest.py in <module>() 
    14 cfd = nltk.ConditionalFreqDist(
    15  (target, fileid[:3]) 
---> 16  for fileid in text.fileids() 
    17  for w in text.words(fileid) 
    18  for target in ['budget','appropriat'] 

/Users/ian/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/nltk/probability.pyc in __init__(self, cond_samples) 
    1727   defaultdict.__init__(self, FreqDist) 
    1728   if cond_samples: 
-> 1729    for (cond, sample) in cond_samples: 
    1730     self[cond].inc(sample) 
    1731 

/Dropbox/hearings/hearings_ingest.py in <genexpr>((fileid,)) 
    15  (target, fileid[:3]) 
    16  for fileid in text.fileids() 
---> 17  for w in text.words(fileid) 
    18  for target in ['budget','appropriat'] 
    19  if w.lower().startswith(target)) 

/Users/ian/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/nltk/corpus/reader/util.pyc in iterate_from(self, start_tok) 
    341 
    342   # If we reach this point, then we should know our length. 
--> 343   assert self._len is not None 
    344 
    345  # Use concat for these, so we can use a ConcatenatedCorpusView 

AssertionError: 

In [7]: 

我包括新的IPython的线来表示,这是完全错误。 (在阅读其他问题时,我看到“AssertionError:”后面往往有更多的信息,在我的错误中它是空白的。)

我很感激任何帮助理解我的代码中的错误!谢谢!

回答

1

我可以通过创建一个空文件,foo,然后调用text.words('foo')重现错误:

In [18]: !touch 'foo' 

In [19]: text = corpus.PlaintextCorpusReader('.', "foo") 

In [20]: text.words('foo') 
AssertionError: 

因此,为了避免空文件,你可以这样做:

cfd = nltk.ConditionalFreqDist(
    (target, fileid[:3]) 
    for fileid in text.fileids() 
    if os.path.getsize(fileid) > 0 # check the filesize is not 0 
    for w in text.words(fileid) 
    for target in ['budget', 'appropriat'] 
    if w.lower().startswith(target)) 
+0

非常感谢!这个伎俩。我正在处理大约13,000个文件,我错误地认为它们都具有正面的文件大小。我想我应该想到这个后,发现错误发生在len不是none的情况下。 –

+0

对。虽然'AssertionError'没有留下任何消息,但回溯是有用的。 – unutbu