以下代码运行良好。只要检查一下:我是否正确使用和计时熊猫,有没有更快的方法?谢谢。这是在熊猫中分组的最快方式吗?
$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
def randFloat(numGrp, N) :
things = [round(100*np.random.random(),4) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e8)
K=100
DF = pd.DataFrame({
'id1' : randChar("id%03d", K, N), # large groups (char)
'id2' : randChar("id%03d", K, N), # large groups (char)
'id3' : randChar("id%010d", N//K, N), # small groups (char)
'id4' : np.random.choice(K, N), # large groups (int)
'id5' : np.random.choice(K, N), # large groups (int)
'id6' : np.random.choice(N//K, N), # small groups (int)
'v1' : np.random.choice(5, N), # int in range [1,5]
'v2' : np.random.choice(5, N), # int in range [1,5]
'v3' : randFloat(100,N) # numeric e.g. 23.5749
})
现在时间5个不同的分组,每个重复两次以确认时间。 [我意识到timeit(2)
运行两次,但它然后报告总数。我对第一次和第二次运行的时间感兴趣。]在这些测试中,Python使用大约10G的RAM,根据htop
。
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
5.604133386000285
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
5.505057081000359
>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
14.232032927000091
>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})" ,"from __main__ import DF").timeit(1)
14.242601240999647
>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
22.87025260900009
>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
22.393589012999655
>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9725865330001398
>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1)
2.9683854739996605
>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1)
12.776488024999708
>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1)
13.558292575999076
下面是系统信息:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 2500.048
BogoMIPS: 5066.38
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
$ free -h
total used free shared buffers cached
Mem: 240G 74G 166G 372K 33M 550M
-/+ buffers/cache: 73G 166G
Swap: 0B 0B 0B
我不相信这是相关的,但为了以防万一,上述randChar
功能是mtrand.RandomState.choice
一个内存错误解决方法:
How to solve memory error in mtrand.RandomState.choice?
'df.groupby'已经很好的优化了。你在考虑什么替代方案?我唯一能想到的就是将'id'列设置为索引,然后使用'df.groupby(level = id_whatever)'。 – 2014-09-02 19:55:35
@PaulH谢谢我将'id'列作为索引。我比较R的'data.table'(我维护)。 – 2014-09-02 20:02:38
哦酷。我要提到的另一件事是,在IPython Notebook中做这件事,并使用'timeit'魔法可能会保持你的一些理智。 http://nbviewer.ipython.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb#-Some-simple-cell-magics – 2014-09-02 20:06:59