2017-04-11 55 views
0

问题1:对于每个ID,我有不同的ID,我想将Item vs. Value曲线的最小值设置为Value。基本上,我想过滤掉这些值,并保持到最小值。Python中的数据剔除问题

问题2.我可以通过在Python中拟合切割曲线来推断吗?

请帮助我更快的解决方案,因为我有大数据集,numpy解决方案会很好。

ID  Item Value 
30702556 40 1 
30702556 41 1 
30702556 42 1 
30702556 43 1 
30702556 44 1.000408 
30702556 45 1.006702067 
30702556 46 1 
30702556 47 1 
30702556 48 1 
30702556 49 1.000157628 
30702556 50 1.001172713 
30702556 51 1.009517935 
30702556 52 1 
30702556 53 1.000502562 
30702556 54 1.001030023 
30702556 55 1 
30702556 56 1.000444755 
30702556 57 1.000199956 
30702556 58 1 
30702556 59 1 
30702556 60 1.00032533 
30702556 61 0.996561721 
30702556 62 0.994058276 
30702556 63 0.994029863 
30702556 64 0.995741839 
30702556 65 0.996079035 
30702556 66 0.992283214 
30702556 67 0.992360022 
30702556 68 0.991403573 
30702556 69 0.989097475 
30702556 70 0.989217641 
30702556 71 0.988622481 
30702556 72 0.987000163 
30702556 73 0.984607074 
30702556 74 0.983260544 
30702556 75 0.983233331 
30702556 76 0.976835524 
30702556 77 0.976070994 
30702556 78 0.975937075 
30702556 79 0.968117537 
30702556 80 0.967753864 
30702556 81 0.963275228 
30702556 82 0.960392687 
30702556 83 0.953357783 
30702556 84 0.941583499 
30702556 85 0.937935151 
30702556 86 0.92811891 
30702556 87 0.924914786 
30702556 88 0.912813207 
30702556 89 0.892052451 
30702556 90 0.875778411 
30702556 91 0.876931504 
30702556 92 0.847877617 
30702556 93 0.834768706 
30702556 94 0.841510584 
30702556 95 0.798555032 
30702556 96 0.781663978 
30702556 97 0.731056793 
30702556 98 0.71332851 
30702556 99 0.808900212 
30702556 100 0.822300396 
30702556 101 0.920676291 
30702556 102 0.911704187 
30702556 103 1 
30702556 104 1 
30702556 105 1 
30702556 106 1 
30702556 107 1 
30702556 108 1 
30702556 109 1 
30702556 110 1 
30702556 111 1 
30702556 112 1 
30702556 113 1 
30702556 114 1 
30702556 115 1 
30702556 116 1 
30702556 117 1 
30702556 118 1 
30702556 119 1 
30703716 40 1 
30703716 41 1 
30703716 42 1 
30703716 43 1 
30703716 44 1.000408 
30703716 45 1.006702067 
30703716 46 1 
30703716 47 1 
30703716 48 1 
30703716 49 1.000157628 
30703716 50 1.001172713 
30703716 51 1.009517935 
30703716 52 1 
30703716 53 1.000502562 
30703716 54 1.001030023 
30703716 55 1 
30703716 56 1.000444755 
30703716 57 1.000199956 
30703716 58 1 
30703716 59 1 
30703716 60 1.00032533 
30703716 61 0.996561721 
30703716 62 0.994058276 
30703716 63 0.994029863 
30703716 64 0.995741839 
30703716 65 0.996079035 
30703716 66 0.992283214 
30703716 67 0.992360022 
30703716 68 0.991403573 
30703716 69 0.989097475 
30703716 70 0.989217641 
30703716 71 0.988622481 
30703716 72 0.987000163 
30703716 73 0.984607074 
30703716 74 0.983260544 
30703716 75 0.983233331 
30703716 76 0.976835524 
30703716 77 0.976070994 
30703716 78 0.975937075 
30703716 79 0.968117537 
30703716 80 0.967753864 
30703716 81 0.963275228 
30703716 82 0.960392687 
30703716 83 0.953357783 
30703716 84 0.941583499 
30703716 85 0.937935151 
30703716 86 0.92811891 
30703716 87 0.924914786 
30703716 88 0.912813207 
30703716 89 0.892052451 
30703716 90 0.875778411 
30703716 91 0.876931504 
30703716 92 0.847877617 
30703716 93 0.834768706 
30703716 94 0.841510584 
30703716 95 0.798555032 
30703716 96 0.781663978 
30703716 97 0.731056793 
30703716 98 0.71332851 
30703716 99 0.808900212 
30703716 100 0.822300396 
30703716 101 0.920676291 
30703716 102 0.911704187 
30703716 103 1 
30703716 104 1 
30703716 105 1 
30703716 106 1 
30703716 107 1 
30703716 108 1 
30703716 109 1 
30703716 110 1 
30703716 111 1 
30703716 112 1 
30703716 113 1 
30703716 114 1 
30703716 115 1 
30703716 116 1 
30703716 117 1 
30703716 118 1 
30703716 119 1 
+0

那么,什么是对给定的样本预期的输出? – Divakar

+0

预期的输出应该在排30702556 98 0.71332851之后斩数据,这个必须对所有ID做 – BigDataScientist

回答

2

使用.loc[:df.Value.idxmin()]

df.groupby('ID', group_keys=False).apply(lambda df: df.loc[:df.Value.idxmin()]) 

  ID Item  Value 
0 30702556 40 1.000000 
1 30702556 41 1.000000 
2 30702556 42 1.000000 
3 30702556 43 1.000000 
4 30702556 44 1.000408 
5 30702556 45 1.006702 
6 30702556 46 1.000000 
7 30702556 47 1.000000 
8 30702556 48 1.000000 
9 30702556 49 1.000158 
10 30702556 50 1.001173 
11 30702556 51 1.009518 
12 30702556 52 1.000000 
13 30702556 53 1.000503 
14 30702556 54 1.001030 
15 30702556 55 1.000000 
16 30702556 56 1.000445 
17 30702556 57 1.000200 
18 30702556 58 1.000000 
19 30702556 59 1.000000 
20 30702556 60 1.000325 
21 30702556 61 0.996562 
22 30702556 62 0.994058 
23 30702556 63 0.994030 
24 30702556 64 0.995742 
25 30702556 65 0.996079 
26 30702556 66 0.992283 
27 30702556 67 0.992360 
28 30702556 68 0.991404 
29 30702556 69 0.989097 
..  ... ...  ... 
109 30703716 69 0.989097 
110 30703716 70 0.989218 
111 30703716 71 0.988622 
112 30703716 72 0.987000 
113 30703716 73 0.984607 
114 30703716 74 0.983261 
115 30703716 75 0.983233 
116 30703716 76 0.976836 
117 30703716 77 0.976071 
118 30703716 78 0.975937 
119 30703716 79 0.968118 
120 30703716 80 0.967754 
121 30703716 81 0.963275 
122 30703716 82 0.960393 
123 30703716 83 0.953358 
124 30703716 84 0.941583 
125 30703716 85 0.937935 
126 30703716 86 0.928119 
127 30703716 87 0.924915 
128 30703716 88 0.912813 
129 30703716 89 0.892052 
130 30703716 90 0.875778 
131 30703716 91 0.876932 
132 30703716 92 0.847878 
133 30703716 93 0.834769 
134 30703716 94 0.841511 
135 30703716 95 0.798555 
136 30703716 96 0.781664 
137 30703716 97 0.731057 
138 30703716 98 0.713329 
1

IIUC:

df.loc[df.groupby('ID')['Value'].idxmin()] 
+0

它只给出最小值记录 – BigDataScientist

+1

啊..我明白了。作为最好的答案。 –