2017-06-13 44 views
0

我想在应用groupby函数后使用列变量的标准偏差从熊猫数据框中删除异常值。应用.groupby()争论后用熊猫数据框中的NaN代替异常值

这是我的数据帧:

  ARI  Flesch Kincaid    Speaker  Score 
0  -2.090000 121.220000 -3.400000     NaN  NaN 
1  8.276460 64.478573 9.034156  William Dudley 1.670275 
2  19.570911 27.362067 17.253580  Janet Yellen -0.604757 
3  -2.090000 121.220000 -3.400000     NaN  NaN 
4  -2.090000 121.220000 -3.400000     NaN  NaN 
5  20.643483 17.069411 18.394178  Lael Brainard 0.215396 
6  -2.090000 121.220000 -3.400000     NaN  NaN 
7  -2.090000 121.220000 -3.400000     NaN  NaN 
8  12.624198 52.220468 11.403157 Jerome H. Powell -1.350798 
9  18.466305 35.186261 16.205693  Stanley Fischer 0.522121 
10 -2.090000 121.220000 -3.400000     NaN  NaN 
11 16.953460 36.246573 15.323457  Lael Brainard -0.217779 
12 -2.090000 121.220000 -3.400000     NaN  NaN 
13 -2.090000 121.220000 -3.400000     NaN  NaN 
14 17.066088 32.592551 16.108486  Stanley Fischer 0.642245 
15 -2.090000 121.220000 -3.400000     NaN  NaN 

我想第一组数据帧由“扬声器”,然后除去“ARI”,“弗莱士”和“金凯德”值异常值所界定与特定特征的平均值相比超过3个标准偏差。

请让我知道这是否可能。谢谢!

+0

你可以把你的数据的片段,而不是附加图像?人们更容易复制它。 – titipata

+0

更好吗?谢谢! –

+0

完美,谢谢格雷厄姆。有人会很快解决它:) – titipata

回答

1

这种方法所需的唯一依赖是Pandas

假设我们已经取代了“扬声器”列中的值“男”的东西代表像“CommitteOrganization”

speaker = dataset['Speaker'].fillna(value='CommitteeOrganization') dataset['Speaker'] = speaker

因此,我们有我们的数据如:

Index ARI Flesch Kincaid Speaker Score 
0 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN 
1 8.276460 64.478573 9.034156 WilliamDudley 1.670275 
2 19.570911 27.362067 17.253580 JanetYellen -0.604757 
3 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN 
4 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN 

集团由熊猫功能:

datasetGrouped = dataset.groupby(by='Speaker').mean()

因此,我们有我们的数据,如:

Speaker    ARI Flesch Kincaid Score 
CommitteeOrganization -2.090000 121.220000 -3.400000 NaN 
JanetYellen 19.570911 27.362067 17.253580 -0.604757 
JeromeH.Powell 12.624198 52.220468 11.403157 -1.350798 
LaelBrainard 18.798471 26.657992 16.858818 -0.001191 
StanleyFischer 17.766196 33.889406 16.157089 0.582183 
WilliamDudley 8.276460 64.478573 9.034156 1.670275 

计算标准偏差为每列:

aristd = datasetGrouped['ARI'].std() 
fleschstd = datasetGrouped['Flesch'].std() 
kincaidstd = datasetGrouped['Kincaid'].std() 

与替换值'NaN'满足条件的行:

datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN' 
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN' 
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN' 

最终的数据集:

Speaker    ARI Flesch Kincaid Score 
CommitteeOrganization -2.090000 NaN -3.400000 NaN 
JanetYellen 19.570911 27.3621 17.253580 -0.604757 
JeromeH.Powell 12.624198 52.2205 11.403157 -1.350798 
LaelBrainard 18.798471 26.658 16.858818 -0.001191 
StanleyFischer 17.766196 33.8894 16.157089 0.582183 
WilliamDudley 8.276460 64.4786 9.034156 1.670275 

的完整代码可以用:Github

注:这可以在更短的代码来完成所呈现的,但答案它做“步步“为了便于理解。

注2:由于问题却有点含糊,如果我没有理解好东西,不提供正确的答案,请不要犹豫,告诉我,如果可能的话我会更新的答案

+0

谢谢!我的一个问题是标准偏差是否计算在所有“发言人”类型中。由于单个扬声器在数据框中有多个条目,因此我想计算每个扬声器的ARI,Flesch和Kincaid的标准偏差和均值,然后根据该特定扬声器的标准偏差替换异常值。那有意义吗?再次感谢! –

+0

个人发言者有多个条目,使用的方法是mean'datasetGrouped = dataset.groupby(by ='Speaker')。mean()' 这就是ARI,Flesch和Kincaid的值由Speaker数据集,是每个“发言人”的个人意思的平均值 – Alber8295

+0

太好了,谢谢 - 我明白现在发生了什么! –