应用功能大熊猫数据帧的每一行以创建两个新列

我有一个熊猫据帧，st包含多个列：应用功能大熊猫数据帧的每一行以创建两个新列

<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23 
Data columns: 
Date(dd-mm-yy)_Time(hh-mm-ss)  53732 non-null values 
Julian_Day       53732 non-null values 
AOT_1020       53716 non-null values 
AOT_870        53732 non-null values 
AOT_675        53188 non-null values 
AOT_500        51687 non-null values 
AOT_440        53727 non-null values 
AOT_380        51864 non-null values 
AOT_340        52852 non-null values 
Water(cm)       51687 non-null values 
%TripletVar_1020     53710 non-null values 
%TripletVar_870      53726 non-null values 
%TripletVar_675      53182 non-null values 
%TripletVar_500      51683 non-null values 
%TripletVar_440      53721 non-null values 
%TripletVar_380      51860 non-null values 
%TripletVar_340      52846 non-null values 
440-870Angstrom      53732 non-null values 
380-500Angstrom      52253 non-null values 
440-675Angstrom      53732 non-null values 
500-870Angstrom      53732 non-null values 
340-440Angstrom      53277 non-null values 
Last_Processing_Date(dd/mm/yyyy) 53732 non-null values 
Solar_Zenith_Angle     53732 non-null values 
dtypes: datetime64[ns](1), float64(22), object(1)

我要为基于应用的功能，这个数据帧创建两个新列数据帧的每一行。我不想多次调用该函数（例如，通过执行两个独立的apply调用），因为它在计算上非常密集。我曾尝试在两个方面这样做，而且他们都没有工作：

使用apply：

我写了一个函数，它接受一个Series并返回我想要的值的元组：

def calculate(s): 
    a = s['path'] + 2*s['row'] # Simple calc for example 
    b = s['path'] * 0.153 
    return (a, b)

试图将其应用到数据框提供了一个错误：

st.apply(calculate, axis=1) 
--------------------------------------------------------------------------- 
AssertionError       Traceback (most recent call last) 
<ipython-input-248-acb7a44054a7> in <module>() 
----> 1 st.apply(calculate, axis=1) 

C:\Python27\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds) 
    4191      return self._apply_raw(f, axis) 
    4192     else: 
-> 4193      return self._apply_standard(f, axis) 
    4194    else: 
    4195     return self._apply_broadcast(f, axis) 

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures) 
    4274     index = None 
    4275 
-> 4276    result = self._constructor(data=results, index=index) 
    4277    result.rename(columns=dict(zip(range(len(res_index)), res_index)), 
    4278       inplace=True) 

C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy) 
    390    mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy) 
    391   elif isinstance(data, dict): 
--> 392    mgr = self._init_dict(data, index, columns, dtype=dtype) 
    393   elif isinstance(data, ma.MaskedArray): 
    394    mask = ma.getmaskarray(data) 

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype) 
    521 
    522   return _arrays_to_mgr(arrays, data_names, index, columns, 
--> 523        dtype=dtype) 
    524 
    525  def _init_ndarray(self, values, index, columns, dtype=None, 

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype) 
    5411 
    5412  # consolidate for now 
-> 5413  mgr = BlockManager(blocks, axes) 
    5414  return mgr.consolidate() 
    5415 

C:\Python27\lib\site-packages\pandas\core\internals.pyc in __init__(self, blocks, axes, do_integrity_check) 
    802 
    803   if do_integrity_check: 
--> 804    self._verify_integrity() 
    805 
    806   self._consolidate_check() 

C:\Python27\lib\site-packages\pandas\core\internals.pyc in _verify_integrity(self) 
    892          "items") 
    893    if block.values.shape[1:] != mgr_shape[1:]: 
--> 894     raise AssertionError('Block shape incompatible with manager') 
    895   tot_items = sum(len(x.items) for x in self.blocks) 
    896   if len(self.items) != tot_items: 

AssertionError: Block shape incompatible with manager

然后，我要使用this question中显示的方法将从apply返回的值分配到两个新列。但是，我甚至无法理解这一点！这一切工作正常，如果我只是返回一个值。

使用循环：

我第一次创建数据帧的两个新列，并将其设置为None：

st['a'] = None 
st['b'] = None

然后环绕在所有的指标，并试图修改这些值是我在那里得到的，但是我做的修改似乎并不奏效。也就是说，没有生成错误，但DataFrame似乎没有被修改。

for i in st.index: 
    # do calc here 
    st.ix[i]['a'] = a 
    st.ix[i]['b'] = b

我认为这两种方法的工作，但他们都没有做。那么，我在这里做错了什么？什么是最好的，最“pythonic”和“pandaonic”的方式来做到这一点？

来源

2013-02-27 robintw

要使第一种方法奏效，请尝试返回一个Series而不是一个元组（apply会引发异常，因为它不知道如何将行重新粘合在一起，因为列数与原始帧不匹配）。

def calculate(s): 
    a = s['path'] + 2*s['row'] # Simple calc for example 
    b = s['path'] * 0.153 
    return pd.Series(dict(col1=a, col2=b))

如果更换第二种方法应该工作：

st.ix[i]['a'] = a

有：

st.ix[i, 'a'] = a

来源

2013-02-28 01:21:16 Garrett

第二种方法的解决方案有效 - 谢谢:-)。但是，我无法得到第一种工作方式。返回一个系列后，我得到一个'mini-df'，但我似乎无法将'apply'函数返回的值返回到原始数据框。使用'st ['a']，st ['b'] = st.apply（calculate，axis = 1）'不起作用，并且也不会将右侧封装在'zip（*）'中。关于我在这里做错了的任何想法？ – robintw 2013-02-28 09:26:14

您可以使用'pd.concat（[df，new_df]，axis = 1）'将'mini'df的列连接到原始DataFrame''。您可能还想考虑基于列的操作，而不是基于行的操作，例如使用'df ['a'] = df ['path'] + 2 * df ['row']计算并添加列'a'] ' – Garrett 2013-03-01 04:56:57

这是在这里解决： Apply pandas function to column to create multiple new columns?

适用于你的问题应该工作：

def calculate(s): 
    a = s['path'] + 2*s['row'] # Simple calc for example 
    b = s['path'] * 0.153 
    return pd.Series({'col1': a, 'col2': b}) 

df = df.merge(df.apply(calculate, axis=1), left_index=True, right_index=True)

来源

2013-07-23 13:48:07 user27564

这只对我有效，如果我使用：返回pd.Series（{'col1'：a，'col2'：b}），是特定于某些版本的Python的语法？ – danio 2016-05-24 14:21:03

我总是使用lambda表达式和内置map()功能通过结合其他行创建新行：

st['a'] = map(lambda path, row: path + 2 * row, st['path'], st['row'])

它可能略高于必要做数值列的线性组合更加复杂。另一方面，我认为采用一种惯例是很好的，因为它可以用于更复杂的行组合（例如使用字符串）或使用其他列的函数填充列中的缺失数据。

例如，假设您有一张表格，其中包含性别，标题以及一些标题缺失的列。基于Assigning New Columns in Method Chains

title_dict = {'male': 'mr.', 'female': 'ms.'} 
table['title'] = map(lambda title, 
    gender: title if title != None else title_dict[gender], 
    table['title'], table['gender'])

来源

2014-06-14 18:14:40

在Python 3中，您将需要使用'tuple（map（...））'而不是'map（...）'。 – Rufflewind 2016-12-25 08:16:43

又一解决方案：你可以用功能填充它们如下

st.assign(a = st['path'] + 2*st['row'], b = st['path'] * 0.153)

注意assign总是返回数据的副本，保持原始数据框不变。

来源

2016-05-10 05:11:29

应用功能大熊猫数据帧的每一行以创建两个新列

回答

相关问题