2016-12-13 60 views
1

我有以下列表:如何从元组列表中提取模式元组?

data = [('Mr', 'PROPN'), ('.', 'PUNCT'), ('William', 'PROPN'), ('Henry', 'PROPN'), ('Gates', 'PROPN'), (',', 'PUNCT'), ('III', 'NUM'), ('is', 'VERB'), ('Founder', 'PROPN'), ('and', 'CONJ'), ('Technology', 'PROPN'), ('Advisor', 'NOUN'), ('Director', 'NOUN'), ('of', 'ADP'), ('Microsoft', 'PROPN'), ('Corporation', 'PROPN'), ('a', 'DET'), ('cofounder', 'NOUN'), ('served', 'VERB'), ('as', 'ADP'), ('Chairman', 'PROPN'), ('from', 'ADP'), ('our', 'PRON'), ('incorporation', 'NOUN'), ('in', 'ADP'), ('1981', 'NUM'), ('until', 'ADP'), ('2014', 'NUM'), ('He', 'PRON'), ('currently', 'ADV'), ('acts', 'VERB'), ('Technical', 'ADJ'), ('to', 'ADP'), ('Nadella', 'NUM'), ('on', 'ADP'), ('key', 'ADJ'), ('development', 'NOUN'), ('projects', 'NOUN'), ('retired', 'VERB'), ('an', 'DET'), ('employee', 'NOUN'), ('2008', 'NUM'), ('Chief', 'NOUN'), ('Software', 'PROPN'), ('Architect', 'PROPN'), ('2000', 'NUM'), ('2006', 'NUM'), ('when', 'ADV'), ('he', 'PRON'), ('announced', 'VERB'), ('his', 'PRON'), ('two', 'NUM'), ('-', 'PUNCT'), ('year', 'NOUN'), ('plan', 'NOUN'), ('transition', 'VERB'), ('out', 'ADP'), ('day', 'NOUN'), ('full', 'ADJ'), ('time', 'NOUN'), ('role', 'NOUN'), ('Executive', 'PROPN'), ('Officer', 'PROPN'), ('resigned', 'VERB'), ('assumed', 'VERB'), ('the', 'DET'), ('position', 'NOUN'), ('As', 'ADP'), ('co', 'PROPN'), ('chair', 'NOUN'), ('Bill', 'NOUN'), ('&', 'CONJ'), ('Melinda', 'PROPN'), ('Foundation', 'PROPN'), ('shapes', 'NOUN'), ('approves', 'VERB'), ('grant', 'NOUN'), ('making', 'VERB'), ('strategies', 'NOUN'), ('advocates', 'NOUN'), ('for', 'ADP'), ('foundation’s', 'NUM'), ('issues', 'NOUN'), ('helps', 'VERB'), ('set', 'VERB'), ('overall', 'ADJ'), ('direction', 'NOUN'), ('organization', 'NOUN'), ('founder', 'NOUN'), ('’', 'NUM'), ('foresight', 'NOUN'), ('vision', 'NOUN'), ('personal', 'ADJ'), ('computing', 'NOUN'), ('have', 'AUX'), ('been', 'VERB'), ('central', 'ADJ'), ('success', 'NOUN'), ('software', 'NOUN'), ('industry', 'NOUN'), ('has', 'VERB'), ('unparalleled', 'ADJ'), ('knowledge', 'NOUN'), ('Company’s', 'NUM'), ('history', 'NOUN'), ('technologies', 'NOUN'), ('Company', 'NOUN'), ('its', 'PRON'), ('grew', 'VERB'), ('fledgling', 'ADJ'), ('business', 'NOUN'), ('into', 'ADP'), ('world’s', 'NUM'), ('leading', 'VERB'), ('company', 'NOUN'), ('process', 'NOUN'), ('creating', 'VERB'), ('one', 'NUM'), ('most', 'ADV'), ('prolific', 'ADJ'), ('sources', 'NOUN'), ('innovation', 'NOUN'), ('powerful', 'ADJ'), ('brands', 'NOUN'), ('through', 'ADP'), ('motion', 'NOUN'), ('technological', 'ADJ'), ('strategic', 'ADJ'), ('programs', 'NOUN'), ('that', 'DET'), ('are', 'VERB'), ('core', 'NOUN'), ('part', 'NOUN'), ('continues', 'VERB'), ('provide', 'VERB'), ('technical', 'ADJ'), ('input', 'NOUN'), ('evolution', 'NOUN'), ('productivity', 'NOUN'), ('platform', 'NOUN'), ('mobile', 'NOUN'), ('first', 'ADJ'), ('cloud', 'NOUN'), ('world', 'NOUN'), ('His', 'PRON'), ('work', 'NOUN'), ('overseeing', 'VERB'), ('provides', 'VERB'), ('global', 'ADJ'), ('insights', 'NOUN'), ('relevant', 'ADJ'), ('current', 'ADJ'), ('future', 'ADJ'), ('opportunities', 'NOUN'), ('keen', 'ADJ'), ('appreciation', 'NOUN'), ('stakeholder', 'ADJ'), ('interests', 'NOUN')] 

我想提取考虑每个元组的第二个元素三合一的模式。例如,假设我想提取所有具有元组之间'of'具有第二元素'NOUN''PROPN'元组:

[('Director', 'NOUN'), ('of', 'ADP'), ('Microsoft', 'PROPN')] 

因此,我的问题是我如何可以提取不使用正则表达式上面的图案? 。我不想使用正则表达式的原因是,我将开始以更多不同的方式提取元组。例如,元组具有作为第一值'world’s'其次'VERB''NOUN'

[('world’s', 'NUM'), ('leading', 'VERB'), ('company', 'NOUN')] 
+0

为什么没有正则表达式? –

+0

因为有时写正则表达式会让这个模式提取任务更加困难@ElliotRoberts –

+0

如果有多个,应该做什么? –

回答

1

比较快,但可能是不必要的紧凑型解决方案:

from itertools import chain 

# Generator of three-tuples matching requirements: 
# If `data` is large enough that temp `list`s are a problem, might be worth 
# using itertools.islice instead of shallow copy slices 
# or using enumerate with lookaround indexing 
matchtups = (((wd0, tp0), (wd1, tp1), (wd2, tp2)) 
      for (wd0, tp0), (wd1, tp1), (wd2, tp2) in zip(data, data[1:], data[2:]) 
      if wd1 == 'of' and tp0 == 'NOUN' and tp2 == 'PROPN') 

# Flatten out the three-tuple structure: 
results = list(chain.from_iterable(matchtups)) 
+0

谢谢,这是最有用的。使用类似于词法分析器或分析器的东西怎么样?......您认为哪种方法最适合完成此任务? –

+0

另外,如果有更多的模式,比如'[('Director','NOUN'),('','ADP'),('Microsoft','PROPN')]'',怎么办?这种方法是否可靠也能抓住它们? –

+1

@johndoe:我的意思是,如果它有两个集合,它们将全部出现在一个平坦的列表中,一组接一个匹配(生成器表达式生成离散的三元组,但我解压缩它们以匹配您的问题的请求输出) 。 – ShadowRanger

1

你可以遍历它:

data = [('Mr', 'PROPN'), ('.', 'PUNCT'), ('William', 'PROPN'), ('Henry', 'PROPN'), ('Gates', 'PROPN'), (',', 'PUNCT'), ('III', 'NUM'), ('is', 'VERB'), ('Founder', 'PROPN'), ('and', 'CONJ'), ('Technology', 'PROPN'), ('Advisor', 'NOUN'), ('Director', 'NOUN'), ('of', 'ADP'), ('Microsoft', 'PROPN'), ('Corporation', 'PROPN'), ('a', 'DET'), ('cofounder', 'NOUN'), ('served', 'VERB'), ('as', 'ADP'), ('Chairman', 'PROPN'), ('from', 'ADP'), ('our', 'PRON'), ('incorporation', 'NOUN'), ('in', 'ADP'), ('1981', 'NUM'), ('until', 'ADP'), ('2014', 'NUM'), ('He', 'PRON'), ('currently', 'ADV'), ('acts', 'VERB'), ('Technical', 'ADJ'), ('to', 'ADP'), ('Nadella', 'NUM'), ('on', 'ADP'), ('key', 'ADJ'), ('development', 'NOUN'), ('projects', 'NOUN'), ('retired', 'VERB'), ('an', 'DET'), ('employee', 'NOUN'), ('2008', 'NUM'), ('Chief', 'NOUN'), ('Software', 'PROPN'), ('Architect', 'PROPN'), ('2000', 'NUM'), ('2006', 'NUM'), ('when', 'ADV'), ('he', 'PRON'), ('announced', 'VERB'), ('his', 'PRON'), ('two', 'NUM'), ('-', 'PUNCT'), ('year', 'NOUN'), ('plan', 'NOUN'), ('transition', 'VERB'), ('out', 'ADP'), ('day', 'NOUN'), ('full', 'ADJ'), ('time', 'NOUN'), ('role', 'NOUN'), ('Executive', 'PROPN'), ('Officer', 'PROPN'), ('resigned', 'VERB'), ('assumed', 'VERB'), ('the', 'DET'), ('position', 'NOUN'), ('As', 'ADP'), ('co', 'PROPN'), ('chair', 'NOUN'), ('Bill', 'NOUN'), ('&', 'CONJ'), ('Melinda', 'PROPN'), ('Foundation', 'PROPN'), ('shapes', 'NOUN'), ('approves', 'VERB'), ('grant', 'NOUN'), ('making', 'VERB'), ('strategies', 'NOUN'), ('advocates', 'NOUN'), ('for', 'ADP'), ('foundation’s', 'NUM'), ('issues', 'NOUN'), ('helps', 'VERB'), ('set', 'VERB'), ('overall', 'ADJ'), ('direction', 'NOUN'), ('organization', 'NOUN'), ('founder', 'NOUN'), ('’', 'NUM'), ('foresight', 'NOUN'), ('vision', 'NOUN'), ('personal', 'ADJ'), ('computing', 'NOUN'), ('have', 'AUX'), ('been', 'VERB'), ('central', 'ADJ'), ('success', 'NOUN'), ('software', 'NOUN'), ('industry', 'NOUN'), ('has', 'VERB'), ('unparalleled', 'ADJ'), ('knowledge', 'NOUN'), ('Company’s', 'NUM'), ('history', 'NOUN'), ('technologies', 'NOUN'), ('Company', 'NOUN'), ('its', 'PRON'), ('grew', 'VERB'), ('fledgling', 'ADJ'), ('business', 'NOUN'), ('into', 'ADP'), ('world’s', 'NUM'), ('leading', 'VERB'), ('company', 'NOUN'), ('process', 'NOUN'), ('creating', 'VERB'), ('one', 'NUM'), ('most', 'ADV'), ('prolific', 'ADJ'), ('sources', 'NOUN'), ('innovation', 'NOUN'), ('powerful', 'ADJ'), ('brands', 'NOUN'), ('through', 'ADP'), ('motion', 'NOUN'), ('technological', 'ADJ'), ('strategic', 'ADJ'), ('programs', 'NOUN'), ('that', 'DET'), ('are', 'VERB'), ('core', 'NOUN'), ('part', 'NOUN'), ('continues', 'VERB'), ('provide', 'VERB'), ('technical', 'ADJ'), ('input', 'NOUN'), ('evolution', 'NOUN'), ('productivity', 'NOUN'), ('platform', 'NOUN'), ('mobile', 'NOUN'), ('first', 'ADJ'), ('cloud', 'NOUN'), ('world', 'NOUN'), ('His', 'PRON'), ('work', 'NOUN'), ('overseeing', 'VERB'), ('provides', 'VERB'), ('global', 'ADJ'), ('insights', 'NOUN'), ('relevant', 'ADJ'), ('current', 'ADJ'), ('future', 'ADJ'), ('opportunities', 'NOUN'), ('keen', 'ADJ'), ('appreciation', 'NOUN'), ('stakeholder', 'ADJ'), ('interests', 'NOUN')] 
[(x,y) for x,y in data if ('NOUN' == y) or ('PROPN' in y)] 

我把2种方法来评估,如果在一个以上,所以你可以选择。 您也可以使用更强大的语法将其转换为pandas来进行查询。这有助于使用数据框的更复杂的查询。

import pandas as pd 
data = [('Mr', 'PROPN'), ('.', 'PUNCT'), ('William', 'PROPN'), ('Henry', 'PROPN'), ('Gates', 'PROPN'), (',', 'PUNCT'), ('III', 'NUM'), ('is', 'VERB'), ('Founder', 'PROPN'), ('and', 'CONJ'), ('Technology', 'PROPN'), ('Advisor', 'NOUN'), ('Director', 'NOUN'), ('of', 'ADP'), ('Microsoft', 'PROPN'), ('Corporation', 'PROPN'), ('a', 'DET'), ('cofounder', 'NOUN'), ('served', 'VERB'), ('as', 'ADP'), ('Chairman', 'PROPN'), ('from', 'ADP'), ('our', 'PRON'), ('incorporation', 'NOUN'), ('in', 'ADP'), ('1981', 'NUM'), ('until', 'ADP'), ('2014', 'NUM'), ('He', 'PRON'), ('currently', 'ADV'), ('acts', 'VERB'), ('Technical', 'ADJ'), ('to', 'ADP'), ('Nadella', 'NUM'), ('on', 'ADP'), ('key', 'ADJ'), ('development', 'NOUN'), ('projects', 'NOUN'), ('retired', 'VERB'), ('an', 'DET'), ('employee', 'NOUN'), ('2008', 'NUM'), ('Chief', 'NOUN'), ('Software', 'PROPN'), ('Architect', 'PROPN'), ('2000', 'NUM'), ('2006', 'NUM'), ('when', 'ADV'), ('he', 'PRON'), ('announced', 'VERB'), ('his', 'PRON'), ('two', 'NUM'), ('-', 'PUNCT'), ('year', 'NOUN'), ('plan', 'NOUN'), ('transition', 'VERB'), ('out', 'ADP'), ('day', 'NOUN'), ('full', 'ADJ'), ('time', 'NOUN'), ('role', 'NOUN'), ('Executive', 'PROPN'), ('Officer', 'PROPN'), ('resigned', 'VERB'), ('assumed', 'VERB'), ('the', 'DET'), ('position', 'NOUN'), ('As', 'ADP'), ('co', 'PROPN'), ('chair', 'NOUN'), ('Bill', 'NOUN'), ('&', 'CONJ'), ('Melinda', 'PROPN'), ('Foundation', 'PROPN'), ('shapes', 'NOUN'), ('approves', 'VERB'), ('grant', 'NOUN'), ('making', 'VERB'), ('strategies', 'NOUN'), ('advocates', 'NOUN'), ('for', 'ADP'), ('foundation’s', 'NUM'), ('issues', 'NOUN'), ('helps', 'VERB'), ('set', 'VERB'), ('overall', 'ADJ'), ('direction', 'NOUN'), ('organization', 'NOUN'), ('founder', 'NOUN'), ('’', 'NUM'), ('foresight', 'NOUN'), ('vision', 'NOUN'), ('personal', 'ADJ'), ('computing', 'NOUN'), ('have', 'AUX'), ('been', 'VERB'), ('central', 'ADJ'), ('success', 'NOUN'), ('software', 'NOUN'), ('industry', 'NOUN'), ('has', 'VERB'), ('unparalleled', 'ADJ'), ('knowledge', 'NOUN'), ('Company’s', 'NUM'), ('history', 'NOUN'), ('technologies', 'NOUN'), ('Company', 'NOUN'), ('its', 'PRON'), ('grew', 'VERB'), ('fledgling', 'ADJ'), ('business', 'NOUN'), ('into', 'ADP'), ('world’s', 'NUM'), ('leading', 'VERB'), ('company', 'NOUN'), ('process', 'NOUN'), ('creating', 'VERB'), ('one', 'NUM'), ('most', 'ADV'), ('prolific', 'ADJ'), ('sources', 'NOUN'), ('innovation', 'NOUN'), ('powerful', 'ADJ'), ('brands', 'NOUN'), ('through', 'ADP'), ('motion', 'NOUN'), ('technological', 'ADJ'), ('strategic', 'ADJ'), ('programs', 'NOUN'), ('that', 'DET'), ('are', 'VERB'), ('core', 'NOUN'), ('part', 'NOUN'), ('continues', 'VERB'), ('provide', 'VERB'), ('technical', 'ADJ'), ('input', 'NOUN'), ('evolution', 'NOUN'), ('productivity', 'NOUN'), ('platform', 'NOUN'), ('mobile', 'NOUN'), ('first', 'ADJ'), ('cloud', 'NOUN'), ('world', 'NOUN'), ('His', 'PRON'), ('work', 'NOUN'), ('overseeing', 'VERB'), ('provides', 'VERB'), ('global', 'ADJ'), ('insights', 'NOUN'), ('relevant', 'ADJ'), ('current', 'ADJ'), ('future', 'ADJ'), ('opportunities', 'NOUN'), ('keen', 'ADJ'), ('appreciation', 'NOUN'), ('stakeholder', 'ADJ'), ('interests', 'NOUN')] 
data = pd.DataFrame(data, columns=['word','type']) 
data[(data.type=='NOUN') | (data.type=='PROPN')] 

编的评论部分:

你必须了解你的数据一样的东西的能力。

data.groupby(data.type).count() 

     word 
type 
ADJ  20 
ADP  12 
ADV  3 
AUX  1 
CONJ  2 
DET  4 
NOUN  56 
NUM  13 
PRON  6 
PROPN 16 
PUNCT  3 
VERB  22 

您可以在计算完成后将其转换回python数据类型。

list(data[(data.type=='NOUN') | (data.type=='PROPN')].word) 
+1

我不知道这可以用熊猫来完成。你能提供更多关于如何使用熊猫来完成这个任务的例子吗?非常感谢你。 –

1

像这样吗?

thing_list = [] 
for i, x in enumerate(data): 
    if x[0] == "of": 
     if (data[i-1][1] == "NOUN") and (data[i+1][1] == "PROPN"): 
      thing_list.append(data[i-1:i+2]) 
+0

我得到了这个:'[[('Director','NOUN'),('','ADP')]]'这是错误的。 –

+0

哎呀,混淆了一些痕迹。我认为现在已经解决了。 –

1

因为看起来很有意思而在这件事上做了一个破解。如果你愿意忍受的map S,lambda S和filter汤s此似乎工作:

matches = map(
    lambda _: (data[_ - 1], data[_], data[_ + 1]), 
    filter(
     lambda _: data[_ - 1][1] == "NOUN" and data[_ + 1][1] == "PROPN", 
     map(
      lambda _: _[0], 
      filter(
       lambda _: _[1][0] == "of", 
       enumerate(data) 
      ) 
     ) 
    ) 
) 
+1

如果您需要'lambda'来使用'map'或'filter',请不要使用'map'或'filter'。它比等效的listcomps或genexprs慢,不易读,不太明显,并且通常不那么简洁。 – ShadowRanger

1
def trioPattern(trioCols, trioElements): 
    """trioCols = (Use element 0 or 1 of the first pair, 0 or 1 of the second pair, 0 or 1 of the third pair) 
     trioElements = (Phrase of the first element, Phrase of the second element, Phrase of the third element)""" 

    data = [('Mr', 'PROPN'), ('.', 'PUNCT'), ('William', 'PROPN'), ('Henry', 'PROPN'), ('Gates', 'PROPN'), (',', 'PUNCT'), ('III', 'NUM'), ('is', 'VERB'), ('Founder', 'PROPN'), ('and', 'CONJ'), ('Technology', 'PROPN'), ('Advisor', 'NOUN'), ('Director', 'NOUN'), ('of', 'ADP'), ('Microsoft', 'PROPN'), ('Corporation', 'PROPN'), ('a', 'DET'), ('cofounder', 'NOUN'), ('served', 'VERB'), ('as', 'ADP'), ('Chairman', 'PROPN'), ('from', 'ADP'), ('our', 'PRON'), ('incorporation', 'NOUN'), ('in', 'ADP'), ('1981', 'NUM'), ('until', 'ADP'), ('2014', 'NUM'), ('He', 'PRON'), ('currently', 'ADV'), ('acts', 'VERB'), ('Technical', 'ADJ'), ('to', 'ADP'), ('Nadella', 'NUM'), ('on', 'ADP'), ('key', 'ADJ'), ('development', 'NOUN'), ('projects', 'NOUN'), ('retired', 'VERB'), ('an', 'DET'), ('employee', 'NOUN'), ('2008', 'NUM'), ('Chief', 'NOUN'), ('Software', 'PROPN'), ('Architect', 'PROPN'), ('2000', 'NUM'), ('2006', 'NUM'), ('when', 'ADV'), ('he', 'PRON'), ('announced', 'VERB'), ('his', 'PRON'), ('two', 'NUM'), ('-', 'PUNCT'), ('year', 'NOUN'), ('plan', 'NOUN'), ('transition', 'VERB'), ('out', 'ADP'), ('day', 'NOUN'), ('full', 'ADJ'), ('time', 'NOUN'), ('role', 'NOUN'), ('Executive', 'PROPN'), ('Officer', 'PROPN'), ('resigned', 'VERB'), ('assumed', 'VERB'), ('the', 'DET'), ('position', 'NOUN'), ('As', 'ADP'), ('co', 'PROPN'), ('chair', 'NOUN'), ('Bill', 'NOUN'), ('&', 'CONJ'), ('Melinda', 'PROPN'), ('Foundation', 'PROPN'), ('shapes', 'NOUN'), ('approves', 'VERB'), ('grant', 'NOUN'), ('making', 'VERB'), ('strategies', 'NOUN'), ('advocates', 'NOUN'), ('for', 'ADP'), ('foundation’s', 'NUM'), ('issues', 'NOUN'), ('helps', 'VERB'), ('set', 'VERB'), ('overall', 'ADJ'), ('direction', 'NOUN'), ('organization', 'NOUN'), ('founder', 'NOUN'), ('’', 'NUM'), ('foresight', 'NOUN'), ('vision', 'NOUN'), ('personal', 'ADJ'), ('computing', 'NOUN'), ('have', 'AUX'), ('been', 'VERB'), ('central', 'ADJ'), ('success', 'NOUN'), ('software', 'NOUN'), ('industry', 'NOUN'), ('has', 'VERB'), ('unparalleled', 'ADJ'), ('knowledge', 'NOUN'), ('Company’s', 'NUM'), ('history', 'NOUN'), ('technologies', 'NOUN'), ('Company', 'NOUN'), ('its', 'PRON'), ('grew', 'VERB'), ('fledgling', 'ADJ'), ('business', 'NOUN'), ('into', 'ADP'), ('world’s', 'NUM'), ('leading', 'VERB'), ('company', 'NOUN'), ('process', 'NOUN'), ('creating', 'VERB'), ('one', 'NUM'), ('most', 'ADV'), ('prolific', 'ADJ'), ('sources', 'NOUN'), ('innovation', 'NOUN'), ('powerful', 'ADJ'), ('brands', 'NOUN'), ('through', 'ADP'), ('motion', 'NOUN'), ('technological', 'ADJ'), ('strategic', 'ADJ'), ('programs', 'NOUN'), ('that', 'DET'), ('are', 'VERB'), ('core', 'NOUN'), ('part', 'NOUN'), ('continues', 'VERB'), ('provide', 'VERB'), ('technical', 'ADJ'), ('input', 'NOUN'), ('evolution', 'NOUN'), ('productivity', 'NOUN'), ('platform', 'NOUN'), ('mobile', 'NOUN'), ('first', 'ADJ'), ('cloud', 'NOUN'), ('world', 'NOUN'), ('His', 'PRON'), ('work', 'NOUN'), ('overseeing', 'VERB'), ('provides', 'VERB'), ('global', 'ADJ'), ('insights', 'NOUN'), ('relevant', 'ADJ'), ('current', 'ADJ'), ('future', 'ADJ'), ('opportunities', 'NOUN'), ('keen', 'ADJ'), ('appreciation', 'NOUN'), ('stakeholder', 'ADJ'), ('interests', 'NOUN')] 

    #Elements of the triple pattern 
    ColE1, ColE2, ColE3 = trioCols 
    trios = dict([((data[e][ColE1], data[e+1][ColE2], data[e+2][ColE3]), (data[e], data[e+1], data[e+2])) for e in range(0, len(data)-2)]) 

    #Triple pattern phrases 
    E1, E2, E3 = trioElements 
    if trios.has_key((E1, E2, E3)): 
     return trios[(E1, E2, E3)] 
    else: 
     return "Not found" 

例子:

trioPattern((1,0,1) )

(('Director', 'NOUN'), ('of', 'ADP'), ('Microsoft', 'PROPN')) 

trioPattern((0,1,1),( “世界”, “动词”, “名词”)),( “PROPN” “的” “名词”,)

(('world’s', 'NUM'), ('leading', 'VERB'), ('company', 'NOUN'))