2016-11-11 59 views
1

我有一个数据帧有一个样本列包含重复样本(以_2结尾)和一个详细说明哪一个是原始样本的列。新类别包含一种突变类型,致病性/可能致病性最具破坏性,而可能良性损害性最小。下面演示了我的数据框的简化/基本版本。有条件删除行不像预期的熊猫

df = pd.DataFrame(columns=['Sample', 'same','New Category'], 
      data=[ 
        ['HG_12_34', 'HG_12_34', 'Pathogenic/Likely Pathogenic'], 
        ['HG_12_34_2', 'HG_12_34', 'Likely Benign'], 
        ['KD_89_9', 'KD_89_9', 'Likely Benign'], 
        ['KD_98_9_2', 'KD_89_9', 'Likely Benign'], 
        ['LG_3_45', 'LG_3_45', 'Likely Benign'], 
        ['LG_3_45_2', 'LG_3_45', 'VUS'] 
        ]) 

我希望有条件地删除无论是样品或取决于哪一个具有新类别,即损害最小的突变,如果一个样本可能已经良重复的具有致病性/ Likley致病变种那么它的重复我想要删除/删除样本行。

我试图通过传递数据框到一个函数,该函数返回一个表示要删除的行的索引列表,然后我放下了它们。

def get_unwanted_duplicates_ix(df): 

    # filter df for samples that have a duplicate 
    same_only = df.groupby("same").filter(lambda x: len(x) > 1) 

    list_index_to_delete = [] 


    for num in range(0,same_only.shape[0]-1): 

     row1 = same_only.irow(num) 
     row2 = same_only.irow(num+1) 
     index = list(same_only.index.values)[num] 



     if row1['Sample']+"_2" == row2['Sample'] or \ 
      row1['Sample'] == row2['Sample']+"_2": 

      if row1['New Category'] == row2['New Category']: 
       list_index_to_delete.append(index+1) 

      elif row1['New Category'] == "Pathogenic/Likely Pathogenic" \ 
       and row2['New Category'] != "Pathogenic/Likely Pathogenic": 
       list_index_to_delete.append(index+1) 

      elif row2['New Category'] == "Pathogenic/Likely Pathogenic" \ 
       and row1['New Category'] != "Pathogenic/Likely Pathogenic": 
       list_index_to_delete.append(index) 

      elif row1['New Category'] == "VUS" \ 
       and row2['New Category'] != "VUS": 
       list_index_to_delete.append(index+1) 

      elif row2['New Category'] == "VUS" \ 
       and row1['New Category'] != "VUS": 
       list_index_to_delete.append(index) 

      elif row1['New Category'] == 'Likely Benign' \ 
       and row2['New Category'] == 'Likely Benign': 
       list_index_to_delete.append(index+1) 

      else: 
       list_index_to_delete.append(index+1) 

    return list_index_to_delete 

unwanted = get_unwanted_duplicates_ix(df) 
df = df.drop(df.index[unwanted]) 

上述功能是一团糟,不出所料,不会像我所希望的那样工作。正确的方向将是最赞赏的一点。

回答

2

首先,用整数替换突变严重性(更高的值意味着更具破坏性)。

df['New Category code'] = df['New Category'].replace(
    {'Likely Benign': 1, 'VUS': 2, 'Pathogenic/Likely Pathogenic': 3}) 

下一个命令取决于是否要保留具有相同严重性的多行。如果是,则通过same列组,并选择具有最大程度的代码行:

df[df.groupby('same')['New Category code'].transform(max) == df['New Category code']]     

     Sample  same     New Category New Category code 
0 HG_12_34 HG_12_34 Pathogenic/Likely Pathogenic     3 
2 KD_89_9 KD_89_9     Likely Benign     1 
3 KD_98_9_2 KD_89_9     Likely Benign     1 
5 LG_3_45_2 LG_3_45       VUS     2 

如果没有(始终保持每个组中只有一行),然后代替排序的严重性值上升,并采取最后的(感谢@JonClements的想法):

df.sort_values('New Category code').groupby('same').last() 

      Sample     New Category New Category code 
same                 
HG_12_34 HG_12_34 Pathogenic/Likely Pathogenic     3 
KD_89_9 KD_98_9_2     Likely Benign     1 
LG_3_45 LG_3_45_2       VUS     2 
+0

这就是你想要的,或者你想不是由'相同'列组?如果不是,请将所需的输出添加到问题中。 –

+1

我建议不要转换和比较最大值(对于具有多个最大值的组将返回多个样本),请按照新的类别代码降序排序,然后应用'groupby('same')。first( )'而不是...(或者按升序排序,然后应用'.last()' - 无论你喜欢什么) –

+0

@JonClements谢谢,我已经更新了答案。 –