2017-03-01 41 views
1

考虑一个数据帧是这样的:熊猫,独特的条件与列字符串追加

coordinates      metric year 
[55.2274742137, 25.1560686018] met_1 2014 
[55.1554330879, 25.0986809174] met_2 2015 
[55.1554330879, 25.0986809174] met_2 2016 
[55.14353879, 25.44] met_221212 2020 
[55.11239959, 25.3232] met_2132 2022 

期望的结果:

coordinates      metric year 
[55.2274742137, 25.1560686018] met_1 2014 
[55.1554330879, 25.0986809174] met_2 [2015,2016] 
[55.14353879, 25.44] met_221212 2020 
[55.11239959, 25.3232] met_2132 2022 

我希望能够找到那些重复的coordinatesmetric列记录。当他们这样做,追加year指标成一个列表,并通过以此为新的year列。然后我想删除重复的

回答

1

您需要groupbyapply

但是,如果与lists柱:

TypeError: unhashable type: 'list'

Solution是转换为可哈希tuples

的另一个问题是,如果需要lists只有更值1,所以需要有点复杂list comprehension

df.coordinates = df.coordinates.apply(tuple) 
df = df.groupby(['coordinates','metric'], sort=False)['year'] 
     .apply(lambda x: list(x) if len(x) > 1 else x.item()) 
df = df.reset_index() 
df.coordinates = df.coordinates.apply(list) 
print (df) 
         coordinates  metric   year 
0 [55.2274742137, 25.1560686018]  met_1   2014 
1 [55.1554330879, 25.0986809174]  met_2 [2015, 2016] 
2   [55.14353879, 25.44] met_221212   2020 
3   [55.11239959, 25.3232] met_2132   2022 

如果可以使用lists输出列的所有值:

df.coordinates = df.coordinates.apply(tuple) 
df = df.groupby(['coordinates','metric'], sort=False)['year'].apply(list) 
df = df.reset_index() 
df.coordinates = df.coordinates.apply(list) 
print (df) 
         coordinates  metric   year 
0 [55.2274742137, 25.1560686018]  met_1  [2014] 
1 [55.1554330879, 25.0986809174]  met_2 [2015, 2016] 
2   [55.14353879, 25.44] met_221212  [2020] 
3   [55.11239959, 25.3232] met_2132  [2022] 

如果需要输出strings

df.coordinates = df.coordinates.apply(tuple) 
df = df.groupby(['coordinates','metric'], sort=False)['year'] 
     .apply(lambda x: ','.join(x.astype(str))) 
df = df.reset_index() 
df.coordinates = df.coordinates.apply(list) 
print (df) 
         coordinates  metric  year 
0 [55.2274742137, 25.1560686018]  met_1  2014 
1 [55.1554330879, 25.0986809174]  met_2 2015,2016 
2   [55.14353879, 25.44] met_221212  2020 
3   [55.11239959, 25.3232] met_2132  2022 
0

您可以在此使用groupby作为帮助:

# dummy data 
df = pd.DataFrame([[[55.2274742137, 25.1560686018], "met_1", 2014], 
        [[55.1554330879, 25.0986809174], "met_2", 2015], 
        [[55.1554330879, 25.0986809174], "met_2", 2015]], 
        columns=["coordinates", "metric", "year"]) 

print(df) 
    coordinates      metric year 
0 [55.2274742137, 25.1560686018] met_1 2014 
1 [55.1554330879, 25.0986809174] met_2 2015 
2 [55.1554330879, 25.0986809174] met_2 2015 

# define apply function 
def aggregate(sub_df): 
    years = sub_df["year"].values 
    if len(years) > 1: 
     return years 
    else: 
     return years[0] 

# groupby needs hashable items, that's why we convert to tuple before 
df["coordinates"] = df["coordinates"].apply(tuple) 

# groupby and apply aggregator 
print(df.groupby(["coordinates", "metric"]).apply(aggregate)) 

coordinates      metric 
(55.1554330879, 25.0986809174) met_2  [2015, 2015] 
(55.2274742137, 25.1560686018) met_1   2014