2016-06-08 81 views
0

.Value_counts()删除了我的其余数据。分析我的数据而不丢失其他信息?还是有另一个字数计数器的代码,我可以使用它不会删除我的其余列​​的数据?Python:如何在使用.Value_counts()时保留所有数据?

这里是我的代码:

from pandas import DataFrame, read_csv 
import pandas as pd 
f1 = pd.read_csv('lastlogonuser.txt', sep='\t', encoding='latin1') 
f2 = pd.read_csv('UserAccounts.csv', sep=',', encoding ='latin1') 
f2 = f2.rename(columns={'Shortname':'User Name'}) 
f = pd.concat([f1, f2]) 
counts = f['User Name'].value_counts() 
f = counts[counts == 1] 
f 

我得到这样的事情,当我运行我的代码:

sample534   1 
sample987   1 
sample342   1 
sample321   1 
sample123   1 

我想是这样的:

User Name Description     CN Account 
1 sample534 Journal Mailbox managed by   
1 sample987 Journal Mailbox managed by  
1 sample342 Journal Mailbox managed by 
1 sample321 Journal Mailbox managed by 
1 sample123 Journal Mailbox managed by 

数据I样本正在使用:

enter code here 
Account User Name User CN      Description 
ENABLED MBJ29  CN=MBJ29,CN=Users    Journal Mailbox managed by 
ENABLED MBJ14  CN=MBJ14,CN=Users    Journal Mailbox managed by 
ENABLED MBJ08  CN=MBJ30,CN=Users    Journal Mailbox managed by 
ENABLED MBJ07  CN=MBJ07,CN=Users    Journal Mailbox managed by 
+0

我想你的目标是让具有独特的用户行的数据帧,正确吗? – shane

回答

1

您可以使用DataFrame.duplicated以确定哪些行是重复的,然后筛选使用loc

f = f.loc[~f.duplicated(subset=['User Name'], keep=False), :] 

subset参数指定只寻找在'User Name'列重复。参数keep=False指定标记所有重复项。由于duplicated返回True重复,我用~否定它。

这有相当数量的重复的一个相当的大数据帧进行测试时,似乎远远超过groupby更高效:

%timeit f.loc[~f.duplicated(subset=['User Name'], keep=False), :] 
100 loops, best of 3: 17.4 ms per loop 

%timeit f.groupby('User Name').filter(lambda x: len(x) == 1) 
1 loop, best of 3: 6.78 s per loop 
+0

谢谢!比价值柜台更好。 – JetCorey

相关问题