2013-02-15 37 views
0

我有两个熊猫数据帧。一个包含我通常的测量(时间索引)。来自不同源的第二帧包含系统状态。它也是时间索引的,但状态数据帧中的时间与我的数据帧与测量的时间不匹配。我想实现的是,现在测量数据帧中的每行都包含测量时间之前状态数据帧中出现的最后一个状态。根据时间将条目从一个熊猫数据帧关联到第二个

举个例子,我有一个国家框架是这样的:

          state 
time           
2013-02-14 12:29:37.101000   SystemReset 
2013-02-14 12:29:39.103000    WaitFace 
2013-02-14 12:29:39.103000  NormalExecution 
2013-02-14 12:29:39.166000  GreetVisitors 
2013-02-14 12:29:46.879000 AskForParticipation 
2013-02-14 12:29:56.807000 IntroduceVernissage 
2013-02-14 12:30:07.275000  PictureQuestion 

我的三围是这样的:

      utime 
time 
2013-02-14 12:29:38.697038  0 
2013-02-14 12:29:38.710432  1 
2013-02-14 12:29:39.106475  2 
2013-02-14 12:29:39.200701  3 
2013-02-14 12:29:40.197014  0 
2013-02-14 12:29:42.217976  5 
2013-02-14 12:29:57.460601  7 

我想用这样的数据帧结束:

      utime     state 
time 
2013-02-14 12:29:38.697038  0   SystemReset 
2013-02-14 12:29:38.710432  1   SystemReset 
2013-02-14 12:29:39.106475  2  NormalExecution 
2013-02-14 12:29:39.200701  3   GreetVisitors 
2013-02-14 12:29:40.197014  0   GreetVisitors 
2013-02-14 12:29:42.217976  5   GreetVisitors 
2013-02-14 12:29:57.460601  7 Introducevernissage 

我发现这样一个非常低效的解决方案:

result = measurements.copy() 
stateList = [] 
for timestamp, _ in measurements.iterrows(): 
    candidateStates = states.truncate(after=timestamp).tail(1) 
    if len(candidateStates) > 0: 
     stateList.append(candidateStates['state'].values[0]) 
    else: 
     stateList.append("unknown") 

result['state'] = stateList 

你看到有什么办法可以优化它吗?

回答

2

也许像

df = df1.join(df2, how='outer') 
df['state'].fillna(method='ffill',inplace=True) 
df.dropna() 

会的工作?该join生产:

>>> df 
              state utime 
time             
2013-02-14 12:29:37.101000   SystemReset NaN 
2013-02-14 12:29:38.697038     NaN  0 
2013-02-14 12:29:38.710432     NaN  1 
2013-02-14 12:29:39.103000    WaitFace NaN 
2013-02-14 12:29:39.103000  NormalExecution NaN 
2013-02-14 12:29:39.106475     NaN  2 
2013-02-14 12:29:39.166000  GreetVisitors NaN 
2013-02-14 12:29:39.200701     NaN  3 
2013-02-14 12:29:40.197014     NaN  0 
2013-02-14 12:29:42.217976     NaN  5 
2013-02-14 12:29:46.879000 AskForParticipation NaN 
2013-02-14 12:29:56.807000 IntroduceVernissage NaN 
2013-02-14 12:29:57.460601     NaN  7 
2013-02-14 12:30:07.275000  PictureQuestion NaN 

,然后我们可以向前填补了状态栏:

>>> df['state'].fillna(method='ffill',inplace=True) 
time 
2013-02-14 12:29:37.101000   SystemReset 
2013-02-14 12:29:38.697038   SystemReset 
2013-02-14 12:29:38.710432   SystemReset 
2013-02-14 12:29:39.103000    WaitFace 
2013-02-14 12:29:39.103000  NormalExecution 
2013-02-14 12:29:39.106475  NormalExecution 
2013-02-14 12:29:39.166000   GreetVisitors 
2013-02-14 12:29:39.200701   GreetVisitors 
2013-02-14 12:29:40.197014   GreetVisitors 
2013-02-14 12:29:42.217976   GreetVisitors 
2013-02-14 12:29:46.879000 AskForParticipation 
2013-02-14 12:29:56.807000 IntroduceVernissage 
2013-02-14 12:29:57.460601 IntroduceVernissage 
2013-02-14 12:30:07.275000  PictureQuestion 
Name: state 

,然后删除该行没有UTIME:

>>> df.dropna() 
              state utime 
time             
2013-02-14 12:29:38.697038   SystemReset  0 
2013-02-14 12:29:38.710432   SystemReset  1 
2013-02-14 12:29:39.106475  NormalExecution  2 
2013-02-14 12:29:39.200701  GreetVisitors  3 
2013-02-14 12:29:40.197014  GreetVisitors  0 
2013-02-14 12:29:42.217976  GreetVisitors  5 
2013-02-14 12:29:57.460601 IntroduceVernissage  7 

您可能必须调整它以处理您在同一时间有(可能多个)状态的情况。可能drop_duplicatestake_last=True会做到这一点。在我的早餐咖啡<<=问题之前,你还必须考虑比我能做的更难一些。

+0

谢谢,这看起来不错。 – languitar 2013-02-18 10:58:46