如何比较，然后从使用python的熊猫数据帧

我写了这个代码的两个不同行串连信息：如何比较，然后从使用python的熊猫数据帧

import pandas as pd 
import numpy as np 

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']), 
    'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']), 
    'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']), 
    'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])} 

output_table = pd.DataFrame(input_table) 

output_table['Previous_Y'] = output_table['Y'] 

output_table.Previous_Y = output_table.Previous_Y.shift(1) 

def Calc_flowpath(x): 
    if x['Z'] == 'First': 
     return x['Y'] 
    else: 
     return x['Previous_Y'] + x['Y']   

output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1) 

print output_table

而且我的输出是（预期）：

 W  X Y  Z Previous_Y Flowpath 
1 1.1 7.0 A First  NaN  A 
2 2.1 8.0 B     A  AB 
3 3.1 9.0 C Last   B  BC 
4 4.1 10.0 D First   C  D 
5 5.1 11.0 E     D  DE 
6 6.1 12.0 E Last   E  EE

然而，我想要做的Flowpath功能是：

If Column Z is "First", Flowpath = Column Y

If Column Z is anything else, Flowpath = Previous Flowpath value + Column Y

Unless Column Y repeats the same value, in which case skip that row.

我的目标输出是：

 W  X Y  Z Previous_Y Flowpath 
1 1.1 7.0 A First  NaN  A 
2 2.1 8.0 B     A  AB 
3 3.1 9.0 C Last   B  ABC 
4 4.1 10.0 D First   C  D 
5 5.1 11.0 E     D  DE 
6 6.1 12.0 E Last   E  DE

为了给出上下文，这些行是制造过程中的步骤，并且我试图描述通过作业车间的路径材料。我的数据是大量的客户订单和他们在制造过程中采取的每一步。 Y是制造步骤，Z列表示每个订单的第一步和最后一步。我使用Knime来做分析，但是我找不到一个可以做到这一点的节点，所以我试图自己写一个python脚本，尽管我是编程新手（正如你可能会看到的那样）。在我以前的工作中，我会使用多行节点在Alteryx中完成此操作，但我无法再访问该软件。我花了很多时间阅读熊猫文档，我觉得解决方案是DataFrame.loc，DataFrame.shift或DataFrame.cumsum的一些组合，但我无法弄清楚。

任何帮助将不胜感激。

来源

2016-08-13 user1673510

我鼓励你接受@ Psidom的回答：它确实是你想要的，并且以一种非常优雅的方式 - 当然是最“可爱”的。 –

遍历DataFrame的行并按照您在OP中概述的逻辑设置Flowpath列的值。

import pandas as pd 

output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1], 
          'X': [7., 8., 9., 10., 11., 12.], 
          'Y': ['A', 'B', 'C', 'D', 'E', 'E'], 
          'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']}, 
          index=range(1, 7)) 

output_table['Flowpath'] = '' 

for idx in output_table.index: 
    this_Z = output_table.loc[idx, 'Z'] 
    this_Y = output_table.loc[idx, 'Y'] 
    last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else '' 
    last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else '' 

    if this_Z == 'First': 
     output_table.loc[idx, 'Flowpath'] = this_Y 
    elif this_Y != last_Y: 
     output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y 
    else: 
     output_table.loc[idx, 'Flowpath'] = last_Flowpath

来源

2016-08-13 16:56:53

所以不好的事情会发生，如果Z['1']!='First'，但为你的情况下，这工作。我明白你想要更多的东西熊猫十岁上下，所以我很抱歉，这个答案是非常简单的蟒蛇......

import pandas as pd 
import numpy as np 

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']), 
    'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']), 
    'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']), 
    'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])} 

ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6']) 
for k in [str(n) for n in range(1,7)]: 
    if(input_table['Z'][k]=='First'): 
     op = input_table['Y'][k] 
    else: 
     if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]): 
      op = ret[str(int(k)-1)] 
     else: 
      op = ret[str(int(k)-1)]+input_table['Y'][k] 
    ret[k]=op 

input_table['Flowpath'] = ret 
output_table = pd.DataFrame(input_table) 
print output_table

粒锥

Flowpath W X Y  Z 
1  A 1.1 7 A First 
2  AB 2.1 8 B  
3  ABC 3.1 9 C Last 
4  D 4.1 10 D First 
5  DE 5.1 11 E  
6  DE 6.1 12 E Last

来源

2016-08-13 17:04:54 kpie

您可以通过cumsum上计算一组变量其中Z为first的条件向量满足第一个和第二个条件，并用空字符串替换上一个相同的值，以便您可以在Y列上执行cumsum，该列应该给出预期的输出：

import pandas as pd 
# calculate the group varaible 
grp = (output_table.Z == "First").cumsum() 

# calculate a condition vector where the current Y column is the same as the previous one 
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g) 

# replace the duplicated process in Y as empty string, group the column by the group variable 
# calculated above and then do a cumulative sum 
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum() 

output_table 

#  W X Y  Z flowPath 
# 1 1.1 7 A First   A 
# 2 2.1 8 B     AB 
# 3 3.1 9 C Last   ABC 
# 4 4.1 10 D First   D 
# 5 5.1 11 E     DE 
# 6 6.1 12 E Last   DE

更新：在上面的代码工作0.15.2下但不0.18.1，但下面可以节省一点点到最后一行调整：

output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)

来源

2016-08-13 17:07:33 Psidom

美丽的逻辑！不幸的是，最后一个'.cumsum（）'在0.18.1中引发了一个'DataError：无数字类型聚合'。 –

@ AlbertoGarcia-Raboso看起来API已经改变了一点，只是更新了一个在'0.18.1'中工作的方法。 Thx为好。 – Psidom

我的荣幸。我认为这可能是一个错误。我会在Github上报告。 –

for index, row in output_table.iterrows(): 
    prev_index = str(int(index) - 1) 
    if row['Z'] == 'First': 
     output_table.set_value(index, 'Flowpath', row['Y']) 
    elif output_table['Y'][prev_index] == row['Y']: 
     output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index]) 
    else: 
     output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y']) 

print output_table 

    W  X Y  Z Previous_Y Flowpath 
1 1.1 7.0 A First  NaN  A 
2 2.1 8.0 B     A  AB 
3 3.1 9.0 C Last   B  ABC 
4 4.1 10.0 D First   C  D 
5 5.1 11.0 E     D  DE 
6 6.1 12.0 E Last   E  DE

来源

2016-08-13 18:25:47

如何比较，然后从使用python的熊猫数据帧

回答

相关问题