2015-07-12 80 views
-1

我想查找“参考”列的重复值,然后保留仅复制来自“金额”列的最大金额列的找到的行。如何合并具有重复值的列并保留Python中不同列的最大值?

电流:

+----------+---------------------+---------+ 
| reference | amount | column3 | column4 | 
+----------+---------------------+---------+ 
| test1 |  9 |  45 | ye  | 
| test1 |  200|  45 | agag | 
| test1 |  1 |  45 | aaa  | 
| test2 |  99 |  45 | bbab | 
| test1 |  11 |  45 | value | 
+----------+---------------------+----------+ 

期望:

+----------+---------------------+---------+ 
| reference | amount | column3 | column4 | 
+----------+---------------------+---------+ 
| test1 |  200|  45 | agag | 
| test2 |  99 |  45 | bbab | 

请分享对这种情况的线索。

+2

什么你的数据格式和你到目前为止做了什么? – Kasramvd

+0

请告知您正在使用哪种数据类型。你基本上可以使用group by,并从每个组中找到最大值。 – vdkotian

+0

这是一个csv文件。我试图找到重复的行。我会继续挖 – serte

回答

0

类似以下内容将是一个良好的开端:

import csv, collections 

with open("mydata.csv", 'r') as f_input: 
    csv_input = csv.reader(f_input) 
    # Assuming the first row contains the heading names, otherwise remove. 
    headings = csv_input.next()  
    d_max_rows = collections.OrderedDict() 

    for cols in csv_input: 
     reference = cols[0] 
     if reference in d_max_rows: 
      cur_max = d_max_rows[reference] 
      if int(cols[1]) >= int(cur_max[1]): 
       d_max_rows[reference] = cols 
     else: 
      d_max_rows[reference] = cols 

lrows = [headings] + list(d_max_rows.itervalues()) 

for reference, amount, col3, col4 in lrows: 
    print "%-15s %-10s %-10s %-10s" % (reference, amount, col3, col4) 

这会给你以下的输出:

reference  amount  column3 column4 
test1   200  45   agag  
test2   99   45   bbab 
+0

@ Martin Evans它的工作原理。谢谢。 – serte

+0

这是个好消息。不要忘记在投票结束时对任何有用的回复投票并接受您的首选答案。 –

0

下面是一些代码,你想要做什么:

from collections import namedtuple 
import csv 

Record = namedtuple('Record', 'reference amount column3 column4') 

no_dups = {} 
with open('references.csv', 'r', newline='') as csvfile: 
    for rec in map(Record._make, csv.reader(csvfile)): 
     if (rec.reference not in no_dups or 
      int(no_dups[rec.reference].amount) < int(rec.amount)): 
      no_dups[rec.reference] = rec 

with open('references_out.csv', 'w', newline='') as csvfile: 
    csv.writer(csvfile).writerows(rec for rec in no_dups.values()) 
0

熊猫是一个非常棒的python模块,用于处理表格数据。它非常像R语言,并提供了一种内存数据库。为了您的例子是这样简单:

import pandas as pd 

df = pd.read_csv('test.csv') 
a = df.groupby('reference')[['amount']].max() 
answer = df.merge(a, on='amount') 

并将结果保存回CSV:

answer.to_csv('out.csv', index=False) 

假设test.csv是您的数据文件,像这样:

reference,amount,column3,column4 
test1,9,45,ye 
test1,200,45,agag 
test1,1,45,aaa 
test2,99,45,bbab 
test1,11,45,value 
相关问题