2014-09-02 47 views
46

以下代码运行良好。只要检查一下:我是否正确使用和计时熊猫,有没有更快的方法?谢谢。这是在熊猫中分组的最快方式吗?

$ python3 
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import pandas as pd 
>>> import numpy as np 
>>> import timeit 
>>> pd.__version__ 
'0.14.1' 

def randChar(f, numGrp, N) : 
    things = [f%x for x in range(numGrp)] 
    return [things[x] for x in np.random.choice(numGrp, N)] 

def randFloat(numGrp, N) : 
    things = [round(100*np.random.random(),4) for x in range(numGrp)] 
    return [things[x] for x in np.random.choice(numGrp, N)] 

N=int(1e8) 
K=100 
DF = pd.DataFrame({ 
    'id1' : randChar("id%03d", K, N),  # large groups (char) 
    'id2' : randChar("id%03d", K, N),  # large groups (char) 
    'id3' : randChar("id%010d", N//K, N), # small groups (char) 
    'id4' : np.random.choice(K, N),   # large groups (int) 
    'id5' : np.random.choice(K, N),   # large groups (int) 
    'id6' : np.random.choice(N//K, N),  # small groups (int)    
    'v1' : np.random.choice(5, N),   # int in range [1,5] 
    'v2' : np.random.choice(5, N),   # int in range [1,5] 
    'v3' : randFloat(100,N)    # numeric e.g. 23.5749 
}) 

现在时间5个不同的分组,每个重复两次以确认时间。 [我意识到timeit(2)运行两次,但它然后报告总数。我对第一次和第二次运行的时间感兴趣。]在这些测试中,Python使用大约10G的RAM,根据htop

>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"       ,"from __main__ import DF").timeit(1) 
5.604133386000285 
>>> timeit.Timer("DF.groupby(['id1']).agg({'v1':'sum'})"       ,"from __main__ import DF").timeit(1) 
5.505057081000359 

>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"      ,"from __main__ import DF").timeit(1) 
14.232032927000091 
>>> timeit.Timer("DF.groupby(['id1','id2']).agg({'v1':'sum'})"      ,"from __main__ import DF").timeit(1) 
14.242601240999647 

>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"    ,"from __main__ import DF").timeit(1) 
22.87025260900009 
>>> timeit.Timer("DF.groupby(['id3']).agg({'v1':'sum', 'v3':'mean'})"    ,"from __main__ import DF").timeit(1) 
22.393589012999655 

>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 
2.9725865330001398 
>>> timeit.Timer("DF.groupby(['id4']).agg({'v1':'mean', 'v2':'mean', 'v3':'mean'})" ,"from __main__ import DF").timeit(1) 
2.9683854739996605 

>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1) 
12.776488024999708 
>>> timeit.Timer("DF.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})" ,"from __main__ import DF").timeit(1) 
13.558292575999076 

下面是系统信息:

$ lscpu 
Architecture:   x86_64 
CPU op-mode(s):  32-bit, 64-bit 
Byte Order:   Little Endian 
CPU(s):    32 
On-line CPU(s) list: 0-31 
Thread(s) per core: 2 
Core(s) per socket: 8 
Socket(s):    2 
NUMA node(s):   2 
Vendor ID:    GenuineIntel 
CPU family:   6 
Model:     62 
Stepping:    4 
CPU MHz:    2500.048 
BogoMIPS:    5066.38 
Hypervisor vendor:  Xen 
Virtualization type: full 
L1d cache:    32K 
L1i cache:    32K 
L2 cache:    256K 
L3 cache:    25600K 
NUMA node0 CPU(s):  0-7,16-23 
NUMA node1 CPU(s):  8-15,24-31 

$ free -h 
      total  used  free  shared buffers  cached 
Mem:   240G  74G  166G  372K  33M  550M 
-/+ buffers/cache:  73G  166G 
Swap:   0B   0B   0B 

我不相信这是相关的,但为了以防万一,上述randChar功能是mtrand.RandomState.choice一个内存错误解决方法:

How to solve memory error in mtrand.RandomState.choice?

+4

'df.groupby'已经很好的优化了。你在考虑什么替代方案?我唯一能想到的就是将'id'列设置为索引,然后使用'df.groupby(level = id_whatever)'。 – 2014-09-02 19:55:35

+0

@PaulH谢谢我将'id'列作为索引。我比较R的'data.table'(我维护)。 – 2014-09-02 20:02:38

+0

哦酷。我要提到的另一件事是,在IPython Notebook中做这件事,并使用'timeit'魔法可能会保持你的一些理智。 http://nbviewer.ipython.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb#-Some-simple-cell-magics – 2014-09-02 20:06:59

回答

4

如果您想要安装iPython shell,您可以使用%timeit轻松地计时代码。安装完成后,您不必键入python来启动python解释器,您需要输入ipython

然后,您可以像输入普通解释器一样键入您的代码(如上所述)。

然后你可以输入,例如:

%timeit DF.groupby(['id1']).agg({'v1':'sum'}) 

这将完成同样的事情,你做了什么,但如果你使用Python了很多,我觉得这将节省您显著打字时间:)。

IPython中有很多其他不错的功能(如%paste,我用它在你的代码粘贴和测试,或%run运行脚本,你已经保存在一个文件中),标签完成,等 http://ipython.org/