Pandas：按列中的观察值数量扩展DataFrame

Stata具有功能expand，该功能将行添加到与特定列中的值相对应的数据库中。例如：Pandas：按列中的观察值数量扩展DataFrame

我有：

df = pd.DataFrame({"A":[1, 2, 3], 
        "B":[3,4,5]}) 

    A B 
0 1 3 
1 2 4 
2 3 5

我需要：

df2 = pd.DataFrame({"A":[1, 2, 3, 2, 3, 3], 
        "B":[3,4,5, 4, 5, 5]}) 

    A B 
0 1 3 
1 2 4 
2 3 5 
3 2 4 
4 3 5 
6 3 5

在df.loc的值[0， 'A']为1，所以不添加额外的行到DataFrame的末尾，因为B = 3只会出现一次。

在df.loc值[1，“A”]为2，一个观测加入到该数据帧的末尾，把B的总发生= 4〜2

中的值df.loc [2，'A']是3，因此在DataFrame的末尾添加了两个观察值，使得B = 5的总发生次数为3.

我已经搜索过以前的问题，我开始了，但没有运气。任何帮助表示赞赏。

来源

2017-07-28 measure_theory

有许多的可能性，各地np.repeat建：

def using_reindex(df): 
    return df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True) 

def using_dictcomp(df): 
    return pd.DataFrame({col:np.repeat(df[col].values, df['A'], axis=0) 
          for col in df}) 

def using_df_values(df): 
    return pd.DataFrame(np.repeat(df.values, df['A'], axis=0), columns=df.columns) 

def using_loc(df): 
    return df.loc[np.repeat(df.index.values, df['A'])].reset_index(drop=True)

例如，

In [219]: df = pd.DataFrame({"A":[1, 2, 3], "B":[3,4,5]}) 
In [220]: df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True) 
Out[220]: 
    A B 
0 1 3 
1 2 4 
2 2 4 
3 3 5 
4 3 5 
5 3 5

这里是一个1000行数据帧的基准;其结果是一个大约500K行数据帧：

In [208]: df = make_dataframe(1000) 

In [210]: %timeit using_dictcomp(df) 
10 loops, best of 3: 23.6 ms per loop 

In [218]: %timeit using_reindex(df) 
10 loops, best of 3: 35.8 ms per loop 

In [211]: %timeit using_df_values(df) 
10 loops, best of 3: 31.3 ms per loop 

In [212]: %timeit using_loc(df) 
1 loop, best of 3: 275 ms per loop

这是我用来生成df代码：

import numpy as np 
import pandas as pd 

def make_dataframe(nrows=100): 
    df = pd.DataFrame(
     {'A': np.arange(nrows), 
     'float': np.random.randn(nrows), 
     'str': np.random.choice('Lorem ipsum dolor sit'.split(), size=nrows), 
     'datetime64': pd.date_range('20000101', periods=nrows)}, 
     index=pd.date_range('20000101', periods=nrows)) 
    return df 

df = make_dataframe(1000)

如果只有几列，using_dictcomp是最快的。但是请注意，using_dictcomp假定df具有唯一的列名称。 using_dictcomp中的字典理解不会重复列名。但是，其他替代方法将与重复的列名称一起使用。

using_reindex和using_loc假定df有一个唯一的索引。

using_reindex从cᴏʟᴅsᴘᴇᴇᴅ的using_loc来了，在（不幸）现在删除帖子。 cᴏʟᴅsᴘᴇᴇᴅ显示没有必要手动重复所有值 - 您只需重复索引，然后让df.loc（或df.reindex）为您重复所有行。它还避免访问df.values，如果df包含多个dtype列，则可以生成object dtype的中间NumPy数组。

来源

2017-07-28 18:47:14 unutbu

我出于好奇而做了一个基准测试，你的第一个解决方案比我的速度快10倍。我知道什么时候我超越了;） –

Pandas：按列中的观察值数量扩展DataFrame

回答

相关问题