2014-12-03 85 views
0

两台运行Ubuntu 14.04.1的机器。相同的源代码在相同的数据上运行。一个工作正常,一个抛出编解码器解码0xe2错误。为什么是这样? (更重要的是,我该如何解决这个问题?)两台不同机器上的相同python源代码产生不同的行为

问题的代码似乎是:

def tokenize(self): 
    """Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing""" 
    tokenized='' 
    for sentence in sent_tokenize(self): 
     tokenized += ' '.join(word_tokenize(sentence)) + '\n' 

    return Text(tokenized) 

OK ......我进入交互模式和进口sent_tokenize从nltk.tokenize在两台机器上。该工程的一个很高兴有以下:

>>> fh = open('in/train/legal/legal1a_lm_7.txt') 
>>> foo = fh.read() 
>>> fh.close() 
>>> sent_tokenize(foo) 

的UnicodeDecodeError错误的机器上的问题给出了下面的回溯:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize 
    return tokenizer.tokenize(text) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize 
    return list(self.sentences_from_text(text, realign_boundaries)) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text 
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize 
    return [(sl.start, sl.stop) for sl in slices] 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries 
    for sl1, sl2 in _pair_iter(slices): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter 
    for el in it: 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text 
    if self.text_contains_sentbreak(context): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak 
    for t in self._annotate_tokens(self._tokenize_words(text)): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass 
    for t1, t2 in _pair_iter(tokens): 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter 
    prev = next(it) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass 
    for aug_tok in tokens: 
    File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words 
    for line in plaintext.split('\n'): 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128) 

打破输入一行文件下行线(通过分裂('\ N')),并运行每一个通过sent_tokenize使我们出错行:

If you have purchased these Services directly from Cisco Systems, Inc. (“Cisco”), this document is incorporated into your Master Services Agreement or equivalent services agreement (“MSA”) executed between you and Cisco. 

这实际上是:

>>> bar[5] 
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.' 

更新:两台机器显示的UnicodeDecodeError为:

unicode(bar[5]) 

但只有一台机器显示了一个错误:

sent_tokenize(bar[5]) 
+1

请向我们展示引发异常的代码,以及触发它的输入数据和完整回溯。 – 2014-12-03 17:22:30

+1

您仍然需要包含回溯和样本数据。编辑代码片段的 – 2014-12-03 17:31:57

+0

。整个项目都在Tk中,所以我会尽量追溯回溯,但可能需要一些时间。看了这段代码后,我想知道是否将字符串更改为unicode(u''&u'\ n')可能不是一个好主意...... – dbl 2014-12-03 17:32:08

回答

0

不同NLTK版本!

不barf的版本正在使用NLTK 2.0.4;抛出异常的版本是3.0.0。

NLTK 2.0.4是完全满意

sent_tokenize('(\xe2\x80\x9cCisco\xe2\x80\x9d)') 

NLTK 3.0.0需求的Unicode(如上面的评论中指出@tdelaney)。因此,要获得结果,您需要:

sent_tokenize(u'(\u201cCisco\u201d)') 
相关问题