Python：使用dict理解/生成器计算列表中的事件

我想编写一些测试来分析python中不同操作的效率，即字典理解和字典生成器的比较。Python：使用dict理解/生成器计算列表中的事件

为了测试这个，我想我会尝试一个简单的例子：使用字典计算列表中的单词数量。

现在我知道你可以使用collections.Counter（根据这里的答案：How can I count the occurrences of a list item in Python?）来做到这一点，但我的目标是测试内存的性能。

一个“长手”的方法是在基本循环中完成。

from pprint import pprint 

# Read in some text to create example data 
with open('text.txt') as f: 
    words = f.read().split() 

dict1 = {} 
for w in words: 
    if not dict1.get(w): 
     dict1[w] = 1 
    else: 
     dict1[w] += 1 
pprint(dict1)

结果：

{'a': 62, 
'aback': 1, 
'able': 1, 
'abolished': 2, 
'about': 6, 
'accept': 1, 
'accepted': 1, 
'accord': 1, 
'according': 1, 
'across': 1, 
...

然后我有点卡住试图做同样的字典解析：

dict2 = { w: 1 if not dict2.get(w) else dict2.get(w) + 1 
      for w in words }

我得到了一个错误：

NameError: global name 'dict2' is not defined

我试着定义前面的字典：

dict2 = {} 
dict2 = { w: 1 if not dict2.get(w) else dict2.get(w) + 1 
      for w in words } 
pprint(dict2)

但当然计数都设置为1：

{'a': 1, 
'aback': 1, 
'able': 1, 
'abolished': 1, 
'about': 1, 
'accept': 1, 
'accepted': 1, 
'accord': 1, 
'according': 1, 
'across': 1, 
...

我也有类似的问题，字典理解：

dict3 = dict((w, 1 if not dict2.get(w) else dict2.get(w) + 1) 
       for w in words)

所以我的问题是：我怎么能最有效地使用字典理解/生成器来计算列表中出现的次数？

更新：@Rawing提出了一种替代方法{word:words.count(word) for word in set(words)}，但这将绕过我试图测试的机制。

来源

2014-11-04 Captain Whippet

'dict2'是空的，如果第一个地方就是你得到那个结果的原因。原因是你在检查'dict2.get（w）'时不要在'dict2'中插入结果。我不知道你是否可以用字典理解来解决这个问题，因为你必须存储计数。 – badc0re 2014-11-04 09:34:23

我认为这样做的方式是'{单词：words.count（单词）在单词（单词）}'，但我怀疑它是有效的。 – 2014-11-04 09:49:52

@ badc0re是的，我想你可能是对的。也许我需要提出一个更好的测试例子。我会看看是否有其他人有任何想法。谢谢你的帮助。 – 2014-11-04 09:50:01

使用字典理解不能有效地（至少在内存方面）做到这一点，因为那样你就必须跟踪另一个字典中的当前计数，即更多的内存消耗。下面是如何使用的字典，理解（不建议在所有:-)）做到这一点：

>>> words = list('asdsadDASDFASCSAASAS') 
>>> dct = {} 
>>> {w: 1 if w not in dct and not dct.update({w: 1}) 
        else dct[w] + 1 
        if not dct.update({w: dct[w] + 1}) else 1 for w in words} 
>>> dct 
{'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}

另一种方式将它们分组使用itertools.groupby排序的话榜第一，然后再算上各家之长组。在这里，快译通，理解可以，如果你想转换为发电机，但是这将需要阅读所有单词记忆第一：

from itertools import groupby 
words.sort() 
dct = {k: sum(1 for _ in g) for k, g in groupby(words)}

注意的是，很多的最快的国家之一是collections.defaultdict：

d = defaultdict(int) 
for w in words: d[w] += 1

时机比较：

>>> from string import ascii_letters, digits 
>>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)} 
10 loops, best of 3: 131 ms per loop 
>>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words) 
10 loops, best of 3: 169 ms per loop 
>>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words} 
1 loops, best of 3: 315 ms per loop 
>>> %%timeit 
... words = list(ascii_letters+digits)*10**4 
... d = defaultdict(int) 
... for w in words: d[w] += 1 
... 
10 loops, best of 3: 57.1 ms per loop 
>>> %%timeit 
words = list(ascii_letters+digits)*10**4 
d = {} 
for w in words: d[w] = d.get(w, 0) + 1 
... 
10 loops, best of 3: 108 ms per loop 

#Increase input size 

>>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)} 
1 loops, best of 3: 1.44 s per loop 
>>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words) 
1 loops, best of 3: 1.7 s per loop 
>>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words} 

1 loops, best of 3: 3.19 s per loop 
>>> %%timeit 
words = list(ascii_letters+digits)*10**5 
d = defaultdict(int) 
for w in words: d[w] += 1 
... 
1 loops, best of 3: 571 ms per loop 
>>> %%timeit 
words = list(ascii_letters+digits)*10**5 
d = {} 
for w in words: d[w] = d.get(w, 0) + 1 
... 
1 loops, best of 3: 1.1 s per loop

来源

2014-11-04 10:30:43

谢谢你 - 这非常有趣。 – 2014-11-05 10:35:51

这是一个用例，理解不适应/高效。

理解是好的，当你可以构建集合在一个单一的操作。这是不是真的这里的情况，因为：

要么因为他们来和变化在字典因此

值你把话或你必须先计算出密钥集（Rawing解决方案），但然后你浏览列表一次获得密钥集，每个键一次

恕我直言，最有效的方式是迭代。

来源

2014-11-04 10:23:42

你可以做我t这样：

>>> words=['this','that','is','if','that','is','if','this','that'] 
>>> {i:words.count(i) for i in words} 
{'this': 2, 'is': 2, 'if': 2, 'that': 3}

来源

2014-11-04 11:21:51

美丽！虽然不知道为什么在这种情况下不能使用理解，即使它是单一操作。 – curlyreggie 2016-01-05 07:31:58

Python：使用dict理解/生成器计算列表中的事件

回答

相关问题