2014-11-25 41 views
0

在一个文件中我有文字,像这样用随机换行:地图线句子分解到另一个列表

Spencer J. Volk, president and CEO of this company, was elected a director. 
Mr. Volk, 55 years old, succeeds Duncan Dwight, 
who retired in September. 

我使用NLTK的句子标记者找到句子,然后标记在那些句子中词汇使用的部分语音标签。例如,标记后,我得到这样的输出(单词的列表,标签元组的每个单词在句子中):

[('Spencer', u'NNP'), ('J.', u'NNP'), ('Volk', u'NNP'), ('president', u'NN'), ('and', u'CC'), ('CEO', u'NN'), ('of', u'IN'), ('this', u'DT'), ('company', u'NN'), ('was', u'VBD'), ('elected', u'VBN'), ('a', u'DT'), ('director', u'NN')] 

[('Mr.', u'NNP'), ('Volk', u'NNP'), ('55', u'CD'), ('years', u'NNS'), ('old', u'JJ'), ('succeeds', u'VBZ'), ('Duncan', u'NNP'), ('Dwight', u'NNP'), ('who', u'WP'), ('retired', u'VBD'), ('in', u'IN'), ('September', u'NNP')] 

但现在我要与同一行写在另一个文件标签打破就像在原来的文件中,我读了文本。对于上面的例子,这将是这样的:

NNP NNP NNP NN CC NN IN DT NN VBD VBN DT NN 
NNP NNP CD NNS JJ VBZ NNP NNP 
WP VBD IN NNP 

我能得到这种形式的标签和一切,但我怎么联系起来的原始换行符标签列表中的断裂?

这样做的一种方法是拆分每个句子,找到\n的索引,希望每个split都对应句子中的一个单词(这可能不总是正确的),然后在该索引处打破标签列表。这更像是一种黑客攻击,在很多情况下都会失败。什么是更强大的方式来实现这一目标?

+0

您为什么删除标点符号?他们非常有用。 – alvas 2014-11-25 08:07:11

+0

@alvas我没有。我正在使用的标记器做到了。 – slider 2014-11-26 00:27:20

回答

0

有趣的拼图。首先,请注意nltk.sent_tokenize()将保留那些一句换行符:

sents = nltk.sent_tokenize(text) 
for s in sents: 
    print(repr(s)) 

所以交错用换行POS标签,你可以一次走一个句子一个令牌,并检查它们之间的换行符:

def process_sent(sent): 
    tagged = nltk.pos_tag(nltk.word_tokenize(sent)) 

    for word, tag in tagged: 
     pre, _, post = sent.partition(word) 
     if "\n" in pre: 
      print("\n", end="") 
     print(tag, end=" ") 
     sent = post # advance to the next word 
    if "\n" in post: 
     print("\n", end="") 

我不太知道为什么,但nltk.sent_tokenize()丢弃发生之间句子边界换行。所以我们也需要寻找它们。幸运的是,我们可以使用完全相同的算法:一次只读全文一句,并检查它们之间的换行符。

sents = nltk.sent_tokenize(text) 
for s in sents: 
    pre, _, post = text.partition(s) 
    if "\n" in pre: 
     print("\n", end="") 
    process_sent(s) 
    text = post # Advance to the next sentence -- munges `text` so use another var if it matters. 

if "\n" in post: 
    print("\n", end="") 

PS。这应该做到这一点,除了只有一个换行符在输出的任何地方有几个相邻的。如果您关心此问题,请拨打if "\n" in pre: print("\n", end="")与此调用:

def nlretain(txt): 
    """Output as many newlines as there are in `txt`""" 
    print("\n"*txt.count("\n"), end="") 
0

不顾断线之后,使用sent_tokenize

>>> from nltk import word_tokenize, pos_tag, sent_tokenize 
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> from nltk import word_tokenize, pos_tag, sent_tokenize>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, ... who retired in September. """>>> 
>>> text = " ".join(i for i in text.split('\n')) 
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)] 
>>> for sent in tagged_text: 
...  poses = " ".join(pos for word, pos in sent) 
...  print poses 
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN . 
NNP NNP , CD NNS JJ , NNS NNP NNP , WP VBN IN NNP . 

注意到换行符:

>>> from nltk import word_tokenize, pos_tag 
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """ 
>>> 
>>> tagged_text = [pos_tag(word_tokenize(sent)) for sent in text.split('\n')] 
>>> for sent in tagged_text: 
...  poses = " ".join(pos for word, pos in sent) 
...  print poses 
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN . 
NNP NNP , CD NNS JJ , NNS NNP NNP , 
WP VBN IN NNP . 

你意识到标记生成器并没有一个适当的句子时有所作为。这是POS标记器使用的上下文信息比单词的默认标记更弱,所以使用sent_tokenize然后再分开非句子并不重要。


如果你想sent_tokenize再拆标签\n作为原始TE

>>> from itertools import chain 
>>> from nltk import sent_tokenize, word_tokenize, pos_tag 
>>> text = """Spencer J. Volk, president and CEO of this company, was elected a director. 
... Mr. Volk, 55 years old, succeeds Duncan Dwight, 
... who retired in September. """ 

>>> sent_lens = [len(word_tokenize(line)) for line in text.split('\n')] 
>>> sent_lens 
[16, 11, 5] 
>>> tagged_text = [[pos for word,pos in pos_tag(word_tokenize(line))] for line in sent_tokenize(text)] 
>>> for l in sent_lens: 
...  sum = 0 
...  for pos in list(chain(*tagged_text))[sum:sum+l]: 
...    print pos, 
...    sum = sum+l 
...  print 
... 
NN NNP NNP , NN CC NNP IN DT NN , VBD VBN DT NN . 
NN NNP NNP , NN CC NNP IN DT NN , 
NN NNP NNP , NN 
+0

那么在这种情况下,标签没有什么不同。但这不是问题。我需要在整个句子上运行标记器。分割文本并在每行上运行标记器很简单。问题的实质是,如果像第一个示例中那样使用'sent_tokenize'运行标记器,您将如何实现第二个结果。 – slider 2014-11-26 00:38:40