如何禁止在python熊猫数据框中生成行号？

我有一个有很多行的大熊猫数据框。如何禁止在python熊猫数据框中生成行号？

id1 id2 id3 count 
0 a  b a 1 
1 a  b b 2 
2 a  b c 3

我想计算行出现次数。这是我正在努力做到这一点。

import pandas as pd 
from collections import Counter 

pdf = pd.DataFrame.from_records(data_tupl) 
cnts = Counter(pdf.itertuples())

不幸的是itertuples()输出行号(0, 'a', 'b', 'a', 1)我绝对不需要。我当然可以将其切断，但这需要一个中间步骤，这会降低性能。是否可以抑制熊猫行输出？

来源

2016-01-24 minerals

尝试设置index = False？ pdf.itertuples(index=False)

http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.itertuples.html

来源

2016-01-24 12:13:46 Weeble

确切地！，哦，我怎么错过这个文档（巴掌！） – minerals

对于很多重复的大DataFrames，它可能更快地使用熊猫方法groupby/count行那将是比使用collections.Counter：

In [75]: df = pd.DataFrame(np.random.randint(2, size=(10000,4))) 

In [76]: df.reset_index().groupby(list(df.columns)).count().to_dict('dict')['index'] 
Out[76]: 
{(0, 0, 0, 0): 639, 
(0, 0, 0, 1): 621, 
(0, 0, 1, 0): 658, 
(0, 0, 1, 1): 595, 
(0, 1, 0, 0): 601, 
(0, 1, 0, 1): 640, 
(0, 1, 1, 0): 643, 
(0, 1, 1, 1): 632, 
(1, 0, 0, 0): 637, 
(1, 0, 0, 1): 644, 
(1, 0, 1, 0): 574, 
(1, 0, 1, 1): 642, 
(1, 1, 0, 0): 612, 
(1, 1, 0, 1): 667, 
(1, 1, 1, 0): 588, 
(1, 1, 1, 1): 607} 

In [77]: collections.Counter(df.itertuples(index=False)) 
Out[77]: Counter({Pandas(_0=1, _1=1, _2=0, _3=1): 667, Pandas(_0=0, _1=0, _2=1, _3=0): 658, Pandas(_0=1, _1=0, _2=0, _3=1): 644, Pandas(_0=0, _1=1, _2=1, _3=0): 643, Pandas(_0=1, _1=0, _2=1, _3=1): 642, Pandas(_0=0, _1=1, _2=0, _3=1): 640, Pandas(_0=0, _1=0, _2=0, _3=0): 639, Pandas(_0=1, _1=0, _2=0, _3=0): 637, Pandas(_0=0, _1=1, _2=1, _3=1): 632, Pandas(_0=0, _1=0, _2=0, _3=1): 621, Pandas(_0=1, _1=1, _2=0, _3=0): 612, Pandas(_0=1, _1=1, _2=1, _3=1): 607, Pandas(_0=0, _1=1, _2=0, _3=0): 601, Pandas(_0=0, _1=0, _2=1, _3=1): 595, Pandas(_0=1, _1=1, _2=1, _3=0): 588, Pandas(_0=1, _1=0, _2=1, _3=0): 574}) 

In [78]: %timeit collections.Counter(df.itertuples(index=False)) 
100 loops, best of 3: 12.8 ms per loop 

In [79]: %timeit df.reset_index().groupby(list(df.columns)).count().to_dict('dict')['index'] 
100 loops, best of 3: 3.74 ms per loop

对于少数人的数据帧重复，速度是可比的：

In [80]: df = pd.DataFrame(np.random.randint(1000, size=(10000,4))) 

In [81]: %timeit collections.Counter(df.itertuples(index=False)) 
100 loops, best of 3: 11.2 ms per loop 

In [82]: %timeit df.reset_index().groupby(list(df.columns)).count().to_dict('dict')['index'] 
100 loops, best of 3: 11.7 ms per loop

来源

2016-01-24 12:38:05 unutbu

如何禁止在python熊猫数据框中生成行号？

回答

相关问题