从POS-标注语料提取句子某个词，标签连击

我与棕色语料库打，特别是在标记的句子“新闻”。我发现“to”是含有最多含糊不清的单词标签（TO，IN，TO-HL，IN-HL，IN-TL，NPS）的单词。我正在尝试编写一个代码，用于为每个与“to”关联的标签从语料库中打印一个句子。这些句子不需要对标签进行“清理”，而只需包含“to”和一个相关的pos标签。从POS-标注语料提取句子某个词，标签连击

brown_sents = nltk.corpus.brown.tagged_sents(categories="news") 
for sent in brown_sents: 
    for word, tag in sent: 
     if (word == 'to' and tag == "IN"): 
      print sent

我想上面的代码只用POS-标签之一，看看它是否工作，但它打印这一切的实例。我需要它打印第一个匹配单词的句子，然后停止。我尝试这样做：

for sent in brown_sents: 
    for word, tag in sent: 
     if (word == 'to' and tag == 'IN'): 
      print sent 
     if (word != 'to' and tag != 'IN'): 
      break

这适用于这种POS标签，因为它是与“为”第一位的，但如果我使用：

for sent in brown_sents: 
    for word, tag in sent: 
     if (word == 'to' and tag == 'TO-HL'): 
      print sent 
     if (word != 'to' and tag != 'TO-HL'): 
      break

它没有返回。我觉得我很贴心 - 帮忙照顾？

来源

2014-11-20 shannimcg

在您更改问题后添加了补充答案。希望能帮助到你。 – alvas 2014-11-21 17:00:49

您可以继续添加到您当前的代码，但你的代码并没有考虑这些事情：

发生什么事，如果“到”发生在句子不止一次与相同或差异POS？
如果您在句子中出现两次相同的POS，您是否希望句子被打印两次？
如果“到”出现在句子的第一个字发生，它的资本化？

如果要坚持你的代码试试这个：

from nltk.corpus import brown 

brown_sents = brown.tagged_sents(categories="news") 

def to_pos_sent(pos): 
    for sent in brown_sents: 
     for word, tag in sent: 
      if word == 'to' and tag == pos: 
       yield sent 

for sent in to_pos_sent('TO'): 
    print sent 

for sent in to_pos_sent('IN'): 
    print sent

我建议你存储的句子在defaultdict(list)，那么你可以随时检索。

from nltk.corpus import brown 
from collections import Counter, defaultdict 

sents_with_to = defaultdict(list) 

to_counts = Counter() 

for i, sent in enumerate(brown.tagged_sents(categories='news')): 
    # Check if 'to' is in sentence. 
    uniq_words = dict(sent) 
    if 'to' in uniq_words or 'To' in uniq_words: 
     # Iterate through the sentence to find 'to' 
     for word, pos in sent: 
      if word.lower()=='to': 
       # Flatten the sentence into a string 
       sents_with_to[pos].append(sent) 
       to_counts[pos]+=1 


for pos in sents_with_to: 
    for sent in sents_with_to[pos]: 
     print pos, sent

要访问特定POS的句子：

for sent in sents_with_to['TO']: 
    print sent

你会意识到，如果“到”与特定POS在句子中出现两次。它在sents_with_to[pos]中记录了两次。如果你想将其删除，请尝试：

sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))

来源

2014-11-20 22:20:45 alvas

谢谢@alvas - 但不是有一些运算符我可以添加我现有的代码的结尾，以便它只是打印它遇到的第一个例子？你写的代码有效，但我知道现有代码中有一些简单的添加。这让我疯狂！ – shannimcg 2014-11-20 22:28:45

更新了答案，希望它有帮助。 – alvas 2014-11-20 22:40:23

作为循环的有效方式，使用'yield'来一次返回一个句子而不是'return'来一次返回所有句子。 – alvas 2014-11-20 22:44:50

至于为什么这是行不通的：

for sent in brown_sents: 
    for word, tag in sent: 
     if (word == 'to' and tag == 'TO-HL'): 
      print sent 
     if (word != 'to' and tag != 'TO-HL'): 
      break

解释之前，你的代码是不是真的接近你想要的输出。这是因为你的陈述并不是真的在做你需要的东西。

首先你需要了解的多个条件（即“如果”）在做什么。

# Loop through the sentence 
for sent in brown_sents: 
    # Loop through each word with its POS 
    for word, tag in sent: 
    # For each sentence checks whether word and tag is in sentence: 
    if word == 'to' and tag == 'TO-HL': 
     print sent # If the condition is true, print sent 
    # After checking the first if, you continue to check the second if 
    # if word is not 'to' and tag is not 'TO-HL', 
    # you want to break out of the sentence. Note that you are still 
    # in the same iteration as the previous condition. 
    if word != 'to' and tag != 'TO-HL': 
    break

现在，让我们先从一些基本if-else声明：

>>> from nltk.corpus import brown 
>>> first_sent = brown.tagged_sents()[0] 
>>> first_sent 
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')] 
>>> for word, pos in first_sent: 
...  if word != 'to' and pos != 'TO-HL': 
...    break 
...  else: 
...    print 'say hi' 
... 
>>>

从上面的例子中，我们通过每一个字+ POS在sentnece和EVERY对字-POS的循环中，if条件将检查它是否不是'to'这个词而不是'TO-HL'，并且如果是这种情况，它会中断并且从未对您产生过say hi。

所以，如果你把你的代码与if-else条件你会没有继续循环，因为to是不是在句子的第一个字和匹配POS是不对总是突破。

实际上，您的if条件正试图检查是否每个单词都是'to'以及它的POS标记是否为'TO-HL'。

你想要做的是检查：

是否“到”在句子，而不是是否每字“到”，然后检查
句子中的'to'是否包含您要查找的POS标签

所以你需要为条件if条件（1）：

>>> from nltk.corpus import brown 
>>> three_sents = brown.tagged_sents()[:3] 
>>> for sent in three_sents: 
...  if 'to' in dict(sent): 
...    print sent 
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

现在你知道if 'to' in dict(sent)检查是否“到”在句子。

然后检查条件（2）：

>>> for sent in three_sents: 
...  if 'to' in dict(sent): 
...    if dict(sent)['to'] == 'TO': 
...      print sent 
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')] 
>>> for sent in three_sents: 
...  if 'to' in dict(sent): 
...    if dict(sent)['to'] == 'TO-HL': 
...      print sent 
... 
>>>

现在你看到if dict(sent)['to'] == 'TO-HL'后并确认了if 'to' in dict(sent)控制条件检查POS限制。

但是，您意识到如果您在句子中有2'to'，dict(sent)['to']仅捕获最终'to'的POS。这就是为什么你需要defaultdict(list)的建议在前面的答案。

真的没有干净的方式来执行检查和最有效的方式描述了前面的答案，感叹。

来源

2014-11-21 16:44:47 alvas

从POS-标注语料提取句子某个词，标签连击

回答

相关问题