2017-05-25 86 views
3

我正在努力解析熊猫中的日期时间。这是我简单的例子:嵌套熊猫数据帧中的解析日期时间

df.iloc[:10,10:] 
Out[45]: 
           response_date   revision scheduleClosedAt scheduleEventIndex scheduleId scheduleOpenedAt 
0 {u'$date': u'2012-01-10T11:00:00.000+0000'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
1 {u'$date': u'2012-01-19T13:00:00.000+0000'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
2 {u'$date': u'2011-06-15T09:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
3 {u'$date': u'2011-06-22T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
4 {u'$date': u'2011-06-30T09:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
5 {u'$date': u'2011-07-05T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
6 {u'$date': u'2011-07-14T10:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
7 {u'$date': u'2011-07-20T09:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
8 {u'$date': u'2011-07-26T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 
9 {u'$date': u'2011-08-02T00:00:00.000+0100'} {u'Measure': 1}    NaN     NaN  NaN    NaN 

我需要摆脱嵌套列“response_date”,并将其转换成正常的timedate,同时保持列名“response_date”/

我想:

>> df_respons = df.response_date.apply(pd.Series) 
>> df_new_response = pd.to_datetime(df_respons) 

,但得到的错误:

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing 

处理嵌套达的任何简洁的方式时间到好看的专栏?

编辑

如何忽略遗漏值?

43025 {u'$date': u'2015-11-18T10:35:00.000+0000'} 
43026 {u'$date': u'2015-11-18T14:23:00.000+0000'} 
43027 {u'$date': u'2015-11-18T14:23:00.000+0000'} 
43028 {u'$date': u'2015-11-18T15:20:00.000+0000'} 
43029 {u'$date': u'2015-11-18T15:20:00.000+0000'} 
43030           NaN 
43031           NaN 
43032 {u'$date': u'2015-11-19T08:00:00.000+0000'} 
43033 {u'$date': u'2015-11-19T08:00:00.000+0000'} 
43034 {u'$date': u'2015-11-24T08:00:00.000+0000'} 

,让一个新的 '0' 栏:

 0     response_date 
43027 NaN 2015-11-18T14:23:00.000+0000 
43028 NaN 2015-11-18T15:20:00.000+0000 
43029 NaN 2015-11-18T15:20:00.000+0000 
43030 NaN       NaN 
43031 NaN       NaN 
43032 NaN 2015-11-19T08:00:00.000+0000 
43033 NaN 2015-11-19T08:00:00.000+0000 
43034 NaN 2015-11-24T08:00:00.000+0000 

回答

1

您可以使用combine_firstfillna用于替换NaNdict,然后可以使用DataFrame构造与values用于转换为numpy array然后tolist

d = {'$date':'response_date'} 
s = pd.Series([{}], index=df.index) 
df = pd.DataFrame(df['0'].combine_first(s).values.tolist()).rename(columns=d) 
#alternatively 
#df = pd.DataFrame(df['0'].fillna(s).values.tolist()).rename(columns=d) 
df['response_date'] = pd.to_datetime(df['response_date']) 
print (df) 
     response_date 
0 2015-11-18 10:35:00 
1 2015-11-18 14:23:00 
2 2015-11-18 14:23:00 
3 2015-11-18 15:20:00 
4 2015-11-18 15:20:00 
5     NaT 
6     NaT 
7 2015-11-19 08:00:00 
8 2015-11-19 08:00:00 
9 2015-11-24 08:00:00 

另外s与map

df['response_date'] = \ 
pd.to_datetime(df['response_date'].map(lambda x: x['$date'] if type(x) == dict else x)) 
print (df) 
      response_date 
43025 2015-11-18 10:35:00 
43026 2015-11-18 14:23:00 
43027 2015-11-18 14:23:00 
43028 2015-11-18 15:20:00 
43029 2015-11-18 15:20:00 
43030     NaT 
43031     NaT 
43032 2015-11-19 08:00:00 
43033 2015-11-19 08:00:00 
43034 2015-11-24 08:00:00 
1

这听起来像你想要的东西像df.apply(lambda row: pd.to_datetime(row['response_date']['$date']), axis=1);

In [41]: df 
Out[41]: 
           response_date 
0 {'$date': '2011-06-15T09:00:00.000+0100'} 

In [42]: df['response_date'] = df.apply(lambda row: pd.to_datetime(row['response_date']['$date']), axis=1) 

In [43]: df 
Out[43]: 
     response_date 
0 2011-06-15 08:00:00 
+0

太好了,谢谢!请参阅编辑的问题。 –

+0

取决于你的意思是“忽略”;要使用NaN删除所有行,请使用'df.dropna()';通常,http://pandas.pydata.org/pandas-docs/stable/missing_data.html包含您可以执行的各种操作的概述。或者你想做的事是'df.apply(lambda row:pd.to_datetime(row ['response_date'] ['$ date'])if not pd.isnull(row ['response_date'])else np.nan ,axis = 1)'? – fuglede

+0

谢谢。我无法从原始数据框中真正删除缺失的值。在最坏的情况下,我可以屏蔽缺失的值,执行你的建议,然后在适当的时间插入值,同时保留原始缺失值。 –

1

试试这个:

In [70]: pd.to_datetime(
      df.response_date.map(lambda x: 
            x['$date'] if isinstance(x, dict) and '$date' in x 
              else x), 
      errors='coerce') 
Out[70]: 
0 2012-01-10 11:00:00 
1 2012-01-19 13:00:00 
2 2011-06-15 08:00:00 
3 2011-06-21 23:00:00 
4 2011-06-30 08:00:00 
5     NaT 
6     NaT 
7 2011-07-20 08:00:00 
8 2011-07-25 23:00:00 
9 2011-08-01 23:00:00 
Name: response_date, dtype: datetime64[ns]