计算时间序列中的缺失值

我正在使用Python和Pandas来分析数据序列。我的DF看起来像这样计算时间序列中的缺失值

      ActivePowerkW WindSpeedms WindSpeedmsstd 
time                
    2015-05-26 11:40:00  836.6328  8.234862  1.414558 
    2015-05-26 11:50:00  968.5992  8.761620  1.572579 
    2015-05-26 12:30:00  614.0503  7.267871  1.575504 
    2015-05-26 13:50:00  945.5604  8.709115  1.527079 
    2015-05-26 14:00:00  926.6531  8.538967  1.589221 
    2015-05-26 14:30:00  666.7984  7.590645  1.324495 
    2015-05-26 14:40:00  911.0134  8.466603  1.708189 
    2015-05-26 15:10:00  1256.1740  9.868224  1.636775 
    2015-05-26 15:30:00  1706.7070 11.225540  1.576277

空值被省略。我想按月计算所有空值作为百分比。

我想最简单的方法是创建一个新的时间序列

timeseries_comp = pd.date_range(df.index[0], df.index[df_length], freq='10min')

，然后用我的DF

dif = df.align(timeseries_comp)

对准这一点，然后只算NAN。这不起作用。对齐错误unsupported type。

什么我终于之后是类似如下

任何想法？

来源

2015-09-04 ardms

你试过重建索引？所以'df.reindex（timeseries_comp）' – EdChum

谢谢。 reindex正是我想要的。现在我需要按月计算。我已经尝试过'Avail_Count = df.resample（'M'，how = {df.count（）：'count'}）'并且似乎可行，但我不关心结果。 – ardms

你应该可以做'df.reindex（timeseries_comp）.groupby（[df.index.year，df.index.month]）.value_counts（drop_na = False）'这应该会给你所有的唯一计数，包括'NaN' ，或者'df.reindex（timeseries_comp）.groupby（[df.index.year，df.index.month]）。apply（pd.Series.isnull）.sum（）' – EdChum

OK，我会在你的索引使用reindex您time_series，然后groupby然后应用isnull，并呼吁sum：

In [113]: 
# load your data, you can ignore this step 
t="""time,ActivePowerkW,WindSpeedms,WindSpeedmsstd 
2015-05-26 11:40:00,836.6328,8.234862,1.414558 
2015-05-26 11:50:00,968.5992,8.761620,1.572579 
2015-05-26 12:30:00,614.0503,7.267871,1.575504 
2015-05-26 13:50:00,945.5604,8.709115,1.527079 
2015-05-26 14:00:00,926.6531,8.538967,1.589221 
2015-05-26 14:30:00,666.7984,7.590645,1.324495 
2015-05-26 14:40:00,911.0134,8.466603,1.708189 
2015-05-26 15:10:00,1256.1740,9.868224,1.636775 
2015-05-26 15:30:00,1706.7070,11.225540,1.576277""" 
df = pd.read_csv(io.StringIO(t), parse_dates=[0], index_col=[0]) 
df 
Out[113]: 
        ActivePowerkW WindSpeedms WindSpeedmsstd 
time               
2015-05-26 11:40:00  836.6328  8.234862  1.414558 
2015-05-26 11:50:00  968.5992  8.761620  1.572579 
2015-05-26 12:30:00  614.0503  7.267871  1.575504 
2015-05-26 13:50:00  945.5604  8.709115  1.527079 
2015-05-26 14:00:00  926.6531  8.538967  1.589221 
2015-05-26 14:30:00  666.7984  7.590645  1.324495 
2015-05-26 14:40:00  911.0134  8.466603  1.708189 
2015-05-26 15:10:00  1256.1740  9.868224  1.636775 
2015-05-26 15:30:00  1706.7070 11.225540  1.576277 

In [115]: 
# create your timeseries 
timeseries_comp = pd.date_range(df.index[0], df.index[len(df)-1], freq='10min') 
timeseries_comp 
Out[115]: 
DatetimeIndex(['2015-05-26 11:40:00', '2015-05-26 11:50:00', 
       '2015-05-26 12:00:00', '2015-05-26 12:10:00', 
       '2015-05-26 12:20:00', '2015-05-26 12:30:00', 
       '2015-05-26 12:40:00', '2015-05-26 12:50:00', 
       '2015-05-26 13:00:00', '2015-05-26 13:10:00', 
       '2015-05-26 13:20:00', '2015-05-26 13:30:00', 
       '2015-05-26 13:40:00', '2015-05-26 13:50:00', 
       '2015-05-26 14:00:00', '2015-05-26 14:10:00', 
       '2015-05-26 14:20:00', '2015-05-26 14:30:00', 
       '2015-05-26 14:40:00', '2015-05-26 14:50:00', 
       '2015-05-26 15:00:00', '2015-05-26 15:10:00', 
       '2015-05-26 15:20:00', '2015-05-26 15:30:00'], 
       dtype='datetime64[ns]', freq='10T', tz=None) 

In [120]: 
# reindex 
new_df = df.reindex(timeseries_comp) 
# group on hour and minute, you can group on some other resolution 
new_df.groupby([new_df.index.hour, new_df.index.minute]).apply(pd.Series.isnull).sum() 
Out[120]: 
ActivePowerkW  15 
WindSpeedms  15 
WindSpeedmsstd 15 
dtype: int64

来源

2015-09-04 09:57:30 EdChum

这是否会计算每月，每小时或每分钟的所有NAN值？它似乎要计算new_df – ardms

中的所有空值，这取决于您如何对它进行分组，在我使用小时和分钟的答案中，但实际上您还应该包括年，月，日，因此它可以区分不同的日期。考虑一下，也许你需要的仅仅是'df.reindex（timeseries_comp）.isnull（）。sum（）' – EdChum

我想我并不清楚我在做什么。我需要每个月都有null/NAN值（或者每个小时都在您的示例中）。不是空值的总数。我编辑了我的问题，并附上了一张图表。 – ardms

计算时间序列中的缺失值

回答

相关问题