总之:
相反的>>> entities
,这样做:
>>> print entities.__repr__()
或者:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities
在长:
问题在于你要打印的ne_chunk
的输出,并且将触发的ghostscript得到的字符串和绘图NE标记句子的表示,这是一个nltk.tree.Tree
对象。这将需要ghostscript,因此您可以使用该小部件来将其可视化。
让我们一步一步地走过。
首先,当您使用ne_chunk
,您可以直接导入它在顶层这样:
from nltk import ne_chunk
而且它建议使用的命名空间为您的进口,即:
from nltk import word_tokenize, pos_tag, ne_chunk
而当你使用ne_chunk
,它来自https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py
目前还不清楚什么样的功能是咸菜加载,但一些检查后,我们发现,只有一个内置的NE分块与工作不基于规则的,因为pickle binary的名称是maxent,所以我们可以假定它是一个统计块,所以它很可能来自NEChunkParser
这个对象:https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py。还有ACE数据API函数,以及pickle二进制文件的名称。
现在,只要你能在ne_chunk
功能,它实际上是调用 NEChunkParser.parse()
函数返回一个nltk.tree.Tree
对象:如果我们看一看在nltk.tree.Tree
JECT这其中出现的ghostscript的问题时,https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118
class NEChunkParser(ChunkParserI):
"""
Expected input: list of pos-tagged words
"""
def __init__(self, train):
self._train(train)
def parse(self, tokens):
"""
Each token should be a pos-tagged word
"""
tagged = self._tagger.tag(tokens)
tree = self._tagged_to_parse(tagged)
return tree
def _train(self, corpus):
# Convert to tagged sequence
corpus = [self._parse_to_tagged(s) for s in corpus]
self._tagger = NEChunkParserTagger(train=corpus)
def _tagged_to_parse(self, tagged_tokens):
"""
Convert a list of tagged tokens to a chunk-parse tree.
"""
sent = Tree('S', [])
for (tok,tag) in tagged_tokens:
if tag == 'O':
sent.append(tok)
elif tag.startswith('B-'):
sent.append(Tree(tag[2:], [tok]))
elif tag.startswith('I-'):
if (sent and isinstance(sent[-1], Tree) and
sent[-1].label() == tag[2:]):
sent[-1].append(tok)
else:
sent.append(Tree(tag[2:], [tok]))
return sent
它试图调用_repr_png_
功能:https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702:
def _repr_png_(self):
"""
Draws and outputs in PNG for ipython.
PNG is used instead of PDF, since it can be displayed in the qt console and
has wider browser support.
"""
import os
import base64
import subprocess
import tempfile
from nltk.draw.tree import tree_to_treesegment
from nltk.draw.util import CanvasFrame
from nltk.internals import find_binary
_canvas_frame = CanvasFrame()
widget = tree_to_treesegment(_canvas_frame.canvas(), self)
_canvas_frame.add_widget(widget)
x, y, w, h = widget.bbox()
# print_to_file uses scrollregion to set the width and height of the pdf.
_canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
with tempfile.NamedTemporaryFile() as file:
in_path = '{0:}.ps'.format(file.name)
out_path = '{0:}.png'.format(file.name)
_canvas_frame.print_to_file(in_path)
_canvas_frame.destroy_widget(widget)
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
'-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}'
.format(out_path, in_path).split())
with open(out_path, 'rb') as sr:
res = sr.read()
os.remove(in_path)
os.remove(out_path)
return base64.b64encode(res).decode()
但是请注意,在解释器中使用>>> entities
时,Python解释器会触发_repr_png
而不是__repr__
,这很奇怪(请参阅Purpose of Python's __repr__)。它不可能是原生的CPython解释器的工作原理时,试图打印出对象的代表性,所以我们来看看Ipython.core.formatters
,我们看到,它允许_repr_png
在https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L725被解雇:
class PNGFormatter(BaseFormatter):
"""A PNG formatter.
To define the callables that compute the PNG representation of your
objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
or :meth:`for_type_by_name` methods to register functions that handle
this.
The return value of this formatter should be raw PNG data, *not*
base64 encoded.
"""
format_type = Unicode('image/png')
print_method = ObjectName('_repr_png_')
_return_type = (bytes, unicode_type)
而且我们看到,当IPython的初始化DisplayFormatter
对象,它试图激活所有格式化:https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66
def _formatters_default(self):
"""Activate the default formatters."""
formatter_classes = [
PlainTextFormatter,
HTMLFormatter,
MarkdownFormatter,
SVGFormatter,
PNGFormatter,
PDFFormatter,
JPEGFormatter,
LatexFormatter,
JSONFormatter,
JavascriptFormatter
]
d = {}
for cls in formatter_classes:
f = cls(parent=self)
d[f.format_type] = f
return d
注意的Ipython
以外,在本地CPython的解释,只会叫__repr__
,而不是_repr_png
:
>>> from nltk import ne_chunk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
>>> entities
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])
所以,现在的解决方案:
解决方案1 :
当打印出ne_chunk
的字符串输出,你可以使用
>>> print entities.__repr__()
而不是那么IPython应该明确地只调用__repr__
而不是调用所有可能的格式化器。
解决方案2
如果你真的需要使用_repr_png_
以可视化的树对象,那么我们就需要弄清楚如何将ghostscript的二进制文件添加到NLTK环境变量。
就你而言,似乎默认的nltk.internals
无法找到二进制文件。更具体地讲,我们指的是https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599
如果我们回到https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726,我们看到的是,它试图去寻找
env_vars=['PATH']
而且NLTK尝试时初始化它的环境变量,它正在寻找在os.environ
,看到https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L495
注意find_binary
电话find_binary_iter
这就要求find_binary_iter
试图通过获取os.environ
0123来寻找
所以,如果我们的路径添加到:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
现在,这应该在IPython的工作:从 “https://www.ghostscript.com/download/gsdnld.html”
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities
您可能会错过楷模。尝试运行'import nltk; nltk.download('all',halt_on_error = False)'。然后重新运行你的脚本。 – alvas
@alvas没有解决它。 –
你在哪里安装了ghostscript? ghostscript .exe文件位于何处? – alvas