2017-06-12 96 views
1

我有一个时间序列(第1列)中,用值(第2栏),这是时间序列中的每个子系列的特征的列数据帧。 如何删除符合条件的子系列?删除子系列(在数据帧中的行),其满足条件

图片说明了什么我想做的事情。我想删除橙色行: enter image description here

我试图使循环创建一个额外的列与功能,指出要删除的行,但这种解决方案是非常计算成本昂贵(我有一列10毫米记录)。代码(慢溶液):

import numpy as np 
import pandas as pd 

# sample data (smaller than actual df) 
# length of df = 100; should be 10000000 in the actual data frame 
time_ser = 100*[25] 
max_num = 20 
distance = np.random.uniform(0,max_num,100) 
to_remove= 100*[np.nan] 

data_dict = {'time_ser':time_ser, 
      'distance':distance, 
      'to_remove': to_remove 
      } 

df = pd.DataFrame(data_dict) 

subser_size = 3 
maxdist = 18 


# loop which creates an additional column which indicates which indexes should be removed. 
# Takes first value in a subseries and checks if it meets the condition. 
# If it does, all values in subseries (i.e. rows) should be removed ('wrong'). 

for i,d in zip(range(len(df)), df.distance): 
    if d >= maxdist: 
     df.to_remove.iloc[i:i+subser_size] = 'wrong' 
    else: 
     df.to_remove.iloc[i] ='good' 

回答

1

您可以使用列表理解为通过numpy.concatenatenumpy.unique创建索引的数组,删除重复。在列

np.random.seed(123) 
time_ser = 100*[25] 
max_num = 20 
distance = np.random.uniform(0,max_num,100) 
to_remove= 100*[np.nan] 

data_dict = {'time_ser':time_ser, 
      'distance':distance, 
      'to_remove': to_remove 
      } 

df = pd.DataFrame(data_dict) 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  NaN 
1 5.722787  25  NaN 
2 4.537029  25  NaN 
3 11.026295  25  NaN 
4 14.389379  25  NaN 
5 8.462129  25  NaN 
6 19.615284  25  NaN 
7 13.696595  25  NaN 
8 9.618638  25  NaN 
9 7.842350  25  NaN 
10 6.863560  25  NaN 
11 14.580994  25  NaN 

subser_size = 3 
maxdist = 18 

print (df.index[df['distance'] >= maxdist]) 
Int64Index([6, 38, 47, 84, 91], dtype='int64') 

arr = [np.arange(i, min(i+subser_size,len(df))) for i in df.index[df['distance'] >= maxdist]] 
idx = np.unique(np.concatenate(arr)) 
print (idx) 
[ 6 7 8 38 39 40 47 48 49 84 85 86 91 92 93] 

df = df.drop(idx) 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  NaN 
1 5.722787  25  NaN 
2 4.537029  25  NaN 
3 11.026295  25  NaN 
4 14.389379  25  NaN 
5 8.462129  25  NaN 
9 7.842350  25  NaN 
10 6.863560  25  NaN 
11 14.580994  25  NaN 
... 
... 

如果需要值:

然后使用drop或者如果需要新的列loc

df['to_remove'] = 'good' 
df.loc[idx, 'to_remove'] = 'wrong' 
print (df) 
    distance time_ser to_remove 
0 13.929384  25  good 
1 5.722787  25  good 
2 4.537029  25  good 
3 11.026295  25  good 
4 14.389379  25  good 
5 8.462129  25  good 
6 19.615284  25  wrong 
7 13.696595  25  wrong 
8 9.618638  25  wrong 
9 7.842350  25  good 
10 6.863560  25  good 
11 14.580994  25  good 
+0

感谢您接受。您也可以注册 - 点击接受标记上方'0'上方的小三角。谢谢。 – jezrael

相关问题