如何删除具有重复名称但保留数据的列

我正在使用熊猫数据框作为属性为英语单词的数据集。词干后，我有多个同名的列。这里是样本数据snap，在词干之后，accept, acceptable and accepted变成accept。我想在所有具有相同名称的列上使用bitwise_or并删除重复的列。我想这个代码如何删除具有重复名称但保留数据的列

import numpy 
from nltk.stem import * 
import pandas as pd 
ps = PorterStemmer() 
dataset = pd.read_csv('sampleData.csv') 
stemmed_words = [] 

for w in list(dataset): 
    stemmed_words.append(ps.stem(w)) 

dataset.columns = stemmed_words 
new_word = stemmed_words[0] 

for w in stemmed_words: 
    if new_word == w: 
     numpy.bitwise_or(dataset[new_word], dataset[w]) 
     del dataset[w] 
    else: 
     new_word = w 

print(dataset)

的问题是，for循环执行

del dataset['accept']

当它删除所有列这个名字。我不知道有多少列将具有相同的名称，并且此代码会生成一个异常KeyError：'accept'

我想在所有三个accept列上应用bitwise_or，将其保存到名为'接受'并删除旧的列。

我希望我不会downvoted这个时候

这里是样本数据：

able abundance academy accept accept accept access accommodation accompany Class 
    0   0  0  0  0  1  1    0   0  C 
    0   0  0  1  0  0  0    0   0  A 
    0   0  0  0  1  0  0    0   0  H 
    0   0  0  0  0  1  0    1   0  G 
    0   0  0  1  0  0  0    0   0  G

输出应该

Class able abundance academy accept access accommodation accompany 
    C  0   0  0  1  1    0   0 
    A  0   0  0  1  0    0   0 
    H  0   0  0  1  0    0   0 
    G  0   0  0  1  0    1   0 
    G  0   0  0  1  0    0   0

来源

2017-05-07 Abrar

IIUC你可以通过列名小组（axis=1 ）。

数据帧：

In [101]: df 
Out[101]: 
    able abundance academy accept accept accept access accommodation accompany Class 
0  0   0  0  0  0  1  1    0   0  C 
1  0   0  0  1  0  0  0    0   0  A 
2  0   0  0  0  1  0  0    0   0  H 
3  0   0  0  0  0  1  0    1   0  G 
4  0   0  0  1  0  0  0    0   0  G

解决方案：

In [103]: df.pop('Class').to_frame() \ 
    ...: .join(df.groupby(df.columns, axis=1).any(1).mul(1)) 
Out[103]: 
    Class able abundance academy accept access accommodation accompany 
0  C  0   0  0  1  1    0   0 
1  A  0   0  0  1  0    0   0 
2  H  0   0  0  1  0    0   0 
3  G  0   0  0  1  0    1   0 
4  G  0   0  0  1  0    0   0

甚至更好的解决方案（@ayhan, thank you for the hint!）：

In [114]: df = df.pop('Class').to_frame().join(df.groupby(df.columns, axis=1).max()) 

In [115]: df 
Out[115]: 
    Class able abundance academy accept access accommodation accompany 
0  C  0   0  0  1  1    0   0 
1  A  0   0  0  1  0    0   0 
2  H  0   0  0  1  0    0   0 
3  G  0   0  0  1  0    1   0 
4  G  0   0  0  1  0    0   0

来源

2017-05-07 11:13:31 MaxU

你能解释一下这种方法更多一点？它没有提供期望的输出。它不会将同名的列分组。我用你的'df.groupby（df.columns，axis = 1）.any（1）.mul（1）' – Abrar

@Abrar替换了OP中的for循环，请提供一个小的__reproducible__样本（3-5行）数据集（文本/ CSV格式 - 所以我们可以复制和粘贴它）和所需的数据集[在你的问题]（http://stackoverflow.com/posts/43830707/edit） – MaxU

@MaxU我认为，而不是多OP正在寻找groupby.sum（因为它们是二进制的，它们的总和将表现为'any' - 1，如果它们中的任何一个是1的话）。 – ayhan

如何删除具有重复名称但保留数据的列

回答

相关问题