地图线句子分解到另一个列表

在一个文件中我有文字，像这样用随机换行：地图线句子分解到另一个列表

Spencer J. Volk, president and CEO of this company, was elected a director. 
Mr. Volk, 55 years old, succeeds Duncan Dwight, 
who retired in September.

我使用NLTK的句子标记者找到句子，然后标记在那些句子中词汇使用的部分语音标签。例如，标记后，我得到这样的输出（单词的列表，标签元组的每个单词在句子中）：

[('Spencer', u'NNP'), ('J.', u'NNP'), ('Volk', u'NNP'), ('president', u'NN'), ('and', u'CC'), ('CEO', u'NN'), ('of', u'IN'), ('this', u'DT'), ('company', u'NN'), ('was', u'VBD'), ('elected', u'VBN'), ('a', u'DT'), ('director', u'NN')] 

[('Mr.', u'NNP'), ('Volk', u'NNP'), ('55', u'CD'), ('years', u'NNS'), ('old', u'JJ'), ('succeeds', u'VBZ'), ('Duncan', u'NNP'), ('Dwight', u'NNP'), ('who', u'WP'), ('retired', u'VBD'), ('in', u'IN'), ('September', u'NNP')]

但现在我要与同一行写在另一个文件标签打破就像在原来的文件中，我读了文本。对于上面的例子，这将是这样的：

NNP NNP NNP NN CC NN IN DT NN VBD VBN DT NN 
NNP NNP CD NNS JJ VBZ NNP NNP 
WP VBD IN NNP

我能得到这种形式的标签和一切，但我怎么联系起来的原始换行符标签列表中的断裂？

这样做的一种方法是拆分每个句子，找到\n的索引，希望每个split都对应句子中的一个单词（这可能不总是正确的），然后在该索引处打破标签列表。这更像是一种黑客攻击，在很多情况下都会失败。什么是更强大的方式来实现这一目标？

来源

2014-11-25 slider

您为什么删除标点符号？他们非常有用。 – alvas 2014-11-25 08:07:11

@alvas我没有。我正在使用的标记器做到了。 – slider 2014-11-26 00:27:20

有趣的拼图。首先，请注意nltk.sent_tokenize()将保留那些内一句换行符：

sents = nltk.sent_tokenize(text) 
for s in sents: 
    print(repr(s))

所以交错用换行POS标签，你可以一次走一个句子一个令牌，并检查它们之间的换行符：

def process_sent(sent): 
    tagged = nltk.pos_tag(nltk.word_tokenize(sent)) 

    for word, tag in tagged: 
     pre, _, post = sent.partition(word) 
     if "\n" in pre: 
      print("\n", end="") 
     print(tag, end=" ") 
     sent = post # advance to the next word 
    if "\n" in post: 
     print("\n", end="")

我不太知道为什么，但nltk.sent_tokenize()丢弃发生之间句子边界换行。所以我们也需要寻找它们。幸运的是，我们可以使用完全相同的算法：一次只读全文一句，并检查它们之间的换行符。

sents = nltk.sent_tokenize(text) 
for s in sents: 
    pre, _, post = text.partition(s) 
    if "\n" in pre: 
     print("\n", end="") 
    process_sent(s) 
    text = post # Advance to the next sentence -- munges `text` so use another var if it matters. 

if "\n" in post: 
    print("\n", end="")

PS。这应该做到这一点，除了只有一个换行符在输出的任何地方有几个相邻的。如果您关心此问题，请拨打if "\n" in pre: print("\n", end="")与此调用：

def nlretain(txt): 
    """Output as many newlines as there are in `txt`""" 
    print("\n"*txt.count("\n"), end="")

来源

2014-11-26 22:46:51 alexis

不顾断线之后，使用sent_tokenize：

>>> from nltk import word_tokenize, pos_tag, sent_tokenize 
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> from nltk import word_tokenize, pos_tag, sent_tokenize>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> text = " ".join(i for i in text.split('\n')) 
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)] 
>>> for sent in tagged_text: 
...  poses = " ".join(pos for word, pos in sent) 
...  print poses 
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN . 
NNP NNP , CD NNS JJ , NNS NNP NNP , WP VBN IN NNP .

注意到换行符：

>>> from nltk import word_tokenize, pos_tag 
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """ 
>>> 
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in text.split('\n')] 
>>> for sent in tagged_text: 
...  poses = " ".join(pos for word, pos in sent) 
...  print poses 
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN . 
NNP NNP , CD NNS JJ , NNS NNP NNP , 
WP VBN IN NNP .

你意识到标记生成器并没有一个适当的句子时有所作为。这是POS标记器使用的上下文信息比单词的默认标记更弱，所以使用sent_tokenize然后再分开非句子并不重要。

如果你想sent_tokenize再拆标签\n作为原始TE

>>> from itertools import chain 
>>> from nltk import sent_tokenize, word_tokenize, pos_tag 
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """ 

>>> sent_lens = [len(word_tokenize(line)) for line in text.split('\n')] 
>>> sent_lens 
[16, 11, 5] 
>>> tagged_text = [[pos for word,pos in pos_tag(word_tokenize(line))] for line in sent_tokenize(text)] 
>>> for l in sent_lens: 
...  sum = 0 
...  for pos in list(chain(*tagged_text))[sum:sum+l]: 
...    print pos, 
...    sum = sum+l 
...  print 
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN . 
NN NNP NNP , NN CC NNP IN DT NN , 
NN NNP NNP , NN

来源

2014-11-25 08:06:46 alvas

那么在这种情况下，标签没有什么不同。但这不是问题。我需要在整个句子上运行标记器。分割文本并在每行上运行标记器很简单。问题的实质是，如果像第一个示例中那样使用'sent_tokenize'运行标记器，您将如何实现第二个结果。 – slider 2014-11-26 00:38:40

地图线句子分解到另一个列表

回答

相关问题