2016-08-17 69 views
1

我玩弄NLTK,当我尝试使用大块模块在NLTK中找不到ghostscript?

enter import nltk as nk 
Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter." 
tokens = nk.word_tokenize(Sentence) 
tagged = nk.pos_tag(tokens) 
entities = nk.chunk.ne_chunk(tagged) 

的代码运行正常,当我输入

>> entities 

我收到以下错误信息:

enter code here Out[2]: Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])Traceback (most recent call last): 

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\formatters.py", line 343, in __call__ 
return method() 

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\tree.py", line 726, in _repr_png_ 
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] + 

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 602, in find_binary 
binary_names, url, verbose)) 

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 596, in find_binary_iter 
url, verbose): 

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 567, in find_file_iter 
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div)) 

LookupError: 

=========================================================================== 
NLTK was unable to find the gs file! 
Use software specific configuration paramaters or set the PATH environment variable. 
=========================================================================== 

根据to this post,解决方案是安装Ghostscript,因为chunker试图用它来显示一个分析树,并且正在寻找其中一个3二进制文件:

file_names=['gs', 'gswin32c.exe', 'gswin64c.exe'] 

要使用。 但即使我安装了ghostscript,我现在可以在Windows搜索中找到二进制文件,但我仍然收到相同的错误。

我需要修复或更新什么?


其他路径信息:

import os; print os.environ['PATH'] 

返回:

C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Program Files (x86)\Parallels\Parallels Tools\Applications;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Oracle\RPAS14.1\RpasServer\bin;C:\Oracle\RPAS14.1\RpasServer\applib;C:\Program Files (x86)\Java\jre7\bin;C:\Program Files (x86)\Java\jre7\bin\client;C:\Program Files (x86)\Java\jre7\lib;C:\Program Files (x86)\Java\jre7\jre\bin\client;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin; 
+0

您可能会错过楷模。尝试运行'import nltk; nltk.download('all',halt_on_error = False)'。然后重新运行你的脚本。 – alvas

+0

@alvas没有解决它。 –

+0

你在哪里安装了ghostscript? ghostscript .exe文件位于何处? – alvas

回答

2

总之

相反的>>> entities,这样做:

>>> print entities.__repr__() 

或者:

>>> import os 
>>> from nltk import word_tokenize, pos_tag, ne_chunk 
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin" 
>>> os.environ['PATH'] += os.pathsep + path_to_gs 
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter." 
>>> entities = ne_chunk(pos_tag(word_tokenize(sent))) 
>>> entities 

在长

问题在于你要打印的ne_chunk的输出,并且将触发的ghostscript得到的字符串和绘图NE标记句子的表示,这是一个nltk.tree.Tree对象。这将需要ghostscript,因此您可以使用该小部件来将其可视化。

让我们一步一步地走过。

首先,当您使用ne_chunk,您可以直接导入它在顶层这样:

from nltk import ne_chunk 

而且它建议使用的命名空间为您的进口,即:

from nltk import word_tokenize, pos_tag, ne_chunk 

而当你使用ne_chunk,它来自https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py

目前还不清楚什么样的功能是咸菜加载,但一些检查后,我们发现,只有一个内置的NE分块与工作不基于规则的,因为pickle binary的名称是maxent,所以我们可以假定它是一个统计块,所以它很可能来自NEChunkParser这个对象:https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py。还有ACE数据API函数,以及pickle二进制文件的名称。

现在,只要你能在ne_chunk功能,它实际上是调用 NEChunkParser.parse()函数返回一个nltk.tree.Tree对象:如果我们看一看在nltk.tree.Tree JECT这其中出现的ghostscript的问题时,https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118

class NEChunkParser(ChunkParserI): 
    """ 
    Expected input: list of pos-tagged words 
    """ 
    def __init__(self, train): 
     self._train(train) 

    def parse(self, tokens): 
     """ 
     Each token should be a pos-tagged word 
     """ 
     tagged = self._tagger.tag(tokens) 
     tree = self._tagged_to_parse(tagged) 
     return tree 

    def _train(self, corpus): 
     # Convert to tagged sequence 
     corpus = [self._parse_to_tagged(s) for s in corpus] 

     self._tagger = NEChunkParserTagger(train=corpus) 

    def _tagged_to_parse(self, tagged_tokens): 
     """ 
     Convert a list of tagged tokens to a chunk-parse tree. 
     """ 
     sent = Tree('S', []) 

     for (tok,tag) in tagged_tokens: 
      if tag == 'O': 
       sent.append(tok) 
      elif tag.startswith('B-'): 
       sent.append(Tree(tag[2:], [tok])) 
      elif tag.startswith('I-'): 
       if (sent and isinstance(sent[-1], Tree) and 
        sent[-1].label() == tag[2:]): 
        sent[-1].append(tok) 
       else: 
        sent.append(Tree(tag[2:], [tok])) 
     return sent 

它试图调用_repr_png_功能:https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702

def _repr_png_(self): 
    """ 
    Draws and outputs in PNG for ipython. 
    PNG is used instead of PDF, since it can be displayed in the qt console and 
    has wider browser support. 
    """ 
    import os 
    import base64 
    import subprocess 
    import tempfile 
    from nltk.draw.tree import tree_to_treesegment 
    from nltk.draw.util import CanvasFrame 
    from nltk.internals import find_binary 
    _canvas_frame = CanvasFrame() 
    widget = tree_to_treesegment(_canvas_frame.canvas(), self) 
    _canvas_frame.add_widget(widget) 
    x, y, w, h = widget.bbox() 
    # print_to_file uses scrollregion to set the width and height of the pdf. 
    _canvas_frame.canvas()['scrollregion'] = (0, 0, w, h) 
    with tempfile.NamedTemporaryFile() as file: 
     in_path = '{0:}.ps'.format(file.name) 
     out_path = '{0:}.png'.format(file.name) 
     _canvas_frame.print_to_file(in_path) 
     _canvas_frame.destroy_widget(widget) 
     subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] + 
         '-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}' 
         .format(out_path, in_path).split()) 
     with open(out_path, 'rb') as sr: 
      res = sr.read() 
     os.remove(in_path) 
     os.remove(out_path) 
     return base64.b64encode(res).decode() 

但是请注意,在解释器中使用>>> entities时,Python解释器会触发_repr_png而不是__repr__,这很奇怪(请参阅Purpose of Python's __repr__)。它不可能是原生的CPython解释器的工作原理时,试图打印出对象的代表性,所以我们来看看Ipython.core.formatters,我们看到,它允许_repr_pnghttps://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L725被解雇:

class PNGFormatter(BaseFormatter): 
    """A PNG formatter. 
    To define the callables that compute the PNG representation of your 
    objects, define a :meth:`_repr_png_` method or use the :meth:`for_type` 
    or :meth:`for_type_by_name` methods to register functions that handle 
    this. 
    The return value of this formatter should be raw PNG data, *not* 
    base64 encoded. 
    """ 
    format_type = Unicode('image/png') 

    print_method = ObjectName('_repr_png_') 

    _return_type = (bytes, unicode_type) 

而且我们看到,当IPython的初始化DisplayFormatter对象,它试图激活所有格式化:https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66

def _formatters_default(self): 
    """Activate the default formatters.""" 
    formatter_classes = [ 
     PlainTextFormatter, 
     HTMLFormatter, 
     MarkdownFormatter, 
     SVGFormatter, 
     PNGFormatter, 
     PDFFormatter, 
     JPEGFormatter, 
     LatexFormatter, 
     JSONFormatter, 
     JavascriptFormatter 
    ] 
    d = {} 
    for cls in formatter_classes: 
     f = cls(parent=self) 
     d[f.format_type] = f 
    return d 

注意的Ipython以外,在本地CPython的解释,只会叫__repr__,而不是_repr_png

>>> from nltk import ne_chunk 
>>> from nltk import word_tokenize, pos_tag, ne_chunk 
>>> Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter." 
>>> sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter." 
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence))) 
>>> entities 
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')]) 

所以,现在的解决方案:

解决方案1 ​​

当打印出ne_chunk的字符串输出,你可以使用

>>> print entities.__repr__() 

而不是那么IPython应该明确地只调用__repr__而不是调用所有可能的格式化器。

解决方案2

如果你真的需要使用_repr_png_以可视化的树对象,那么我们就需要弄清楚如何将ghostscript的二进制文件添加到NLTK环境变量。

就你而言,似乎默认的nltk.internals无法找到二进制文件。更具体地讲,我们指的是https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599

如果我们回到https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726,我们看到的是,它试图去寻找

env_vars=['PATH'] 

而且NLTK尝试时初始化它的环境变量,它正在寻找在os.environ,看到https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L495

注意find_binary电话find_binary_iter这就要求find_binary_iter试图通过获取os.environ

0123来寻找

所以,如果我们的路径添加到:

>>> import os 
>>> from nltk import word_tokenize, pos_tag, ne_chunk 
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin" 
>>> os.environ['PATH'] += os.pathsep + path_to_gs 

现在,这应该在IPython的工作:从 “https://www.ghostscript.com/download/gsdnld.html

>>> import os 
>>> from nltk import word_tokenize, pos_tag, ne_chunk 
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin" 
>>> os.environ['PATH'] += os.pathsep + path_to_gs 
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter." 
>>> entities = ne_chunk(pos_tag(word_tokenize(sent))) 
>>> entities 
+0

在将正斜杠更改为path_to_gs赋值中的反斜杠后工作。谢谢 –

+0

你可以编辑你的问题,将输出添加到'os.environ ['PATH']'?如果其他人有相同的问题,这将有助于未来=)谢谢! – alvas

0

下载gs.exe及其路径添加到Environment Variables

路径可能储存在

C:\ Program Files文件\

(在我的系统,它看起来像 “C:\ Program Files文件\ GS \ gs9.21 \ BIN”)

并为它添加到环境变量:

控制面板 - >系统和安全 - >系统 - >高级系统设置 - >环境变量 - >(在系统变量向下滚动, 路径上双击) - >

然后添加复制路径

(在我的情况 “C:\ Program Files文件\ GS \ gs9.21 \ BIN”)

附::不要忘记在处理路径之前添加分号(;),而不是删除现有路径,然后将其放在那里,否则可能会陷入困境并需要运行备份:)