2015-05-19 61 views
-4

我需要在我的txt文件中找到重复项。该文件看起来像这样:在33,000行文件中重复排序

3,3090,21,f,2,3 
4,231,22,m,2,3 
5,9427,13,f,2,2 
6,9942,7,m,2,3 
7,6802,33,f,3,2 
8,8579,11,f,2,4 
9,8598,11,f,2,4 
10,16729,23,m,1,1 
11,8472,11,f,3,4 
12,10976,21,f,3,3 
13,2870,21,f,2,3 
14,12032,10,f,3,4 
15,16999,13,m,2,2 
16,570,7,f,2,3 
17,8485,11,f,2,4 
18,8728,11,f,3,4 
19,20861,9,f,2,2 
20,19771,34,f,2,2 
21,17964,10,f,2,2 

有~30000行此。现在,我需要在第二列中找到重复项并保存到新文件中,而不会有任何重复项。我的代码是:

def dedupe(data): 
    d = [] 
    for l in lines: 
     if l[0] in d: 
      d[l[0]] += l[:1] 
     else: 
     d[l[0]] = l[1] 
    return d 

#m - male 
#f - female 

data = open('plec.txt', 'r') 
save_m = open('plec_m.txt', 'w') 
save_f = open('plec_f.txt', 'w') 

lines = data.readlines()[1:] 

for line in lines: 
    gender = line.strip().split(',')[3] 
    if gender is 'f': 
     dedupe(line) 
     save_f.write(line) 
    elif gender is 'm': 
     dedupe(line) 
     save_m.write(line) 

但我发现了这个错误:

Traceback (most recent call last): 
File "plec.py", line 88, in <module> 
     dedupe(line) 
File "plec.py", line 75, in dedupe 
     d[l[0]] = l[1] 
TypeError: list indices must be integers, not str' 
+5

这工作?如果不是,你遇到了什么问题? – SuperBiasedMan

+0

寻找重复项不起作用。保存女性档案和男性档案是可以的。但重复查找算法不起作用。 – Fempter

+1

你会得到什么结果? '不工作'可能意味着该文件是空白的,或者它有每一行,或者它崩溃。 – SuperBiasedMan

回答

1
seen = set() 
for row in my_filehandle: 
    my_2nd_col = row.split(",")[1] 
    if my_2nd_col in seen: 
     continue 
    output_filehandle.write(row) 
    seen.add(my_2nd_column) 

是这样

0

OP的一个非常详细的方式,我不知道什么是错的与您的代码,但这种解决方案应符合您的要求,假设您的要求是:

  • 在单独的文件

这里筛选基于第二列中的文件

  • 商店男女项目的代码:

    with open('plec.txt') as file: 
        lines = map(lambda line: line.split(','), file.read().split('\n')) # split the file into lines and the lines by comma 
        filtered_lines_male = [] 
        filtered_lines_female = [] 
        second_column_set = set() 
        for line in lines: 
         if(line[1] not in second_column_set): 
          second_column_set.add(line[1]) # add to index set 
          if(line[3] == 'm'): 
           filtered_lines_male.append(line) # add to male list 
          else: 
           filtered_lines_female.append(line) # add to female list 
    
        filtered_lines_male = '\n'.join([','.join(line) for line in filtered_lines_male]) # apply source formatting 
        filtered_lines_female = '\n'.join([','.join(line) for line in filtered_lines_female]) # apply source formatting 
    
        with open('plec_m.txt', 'w') as male_write_file: 
         male_write_file.write(filtered_lines_male) # write male entries 
    
        with open('plec_f.txt', 'w') as female_write_file: 
         female_write_file.write(filtered_lines_female) # write female entries 
    

    请使用更好的变量,你写的代码,并请下一次命名确定你的问题更具体。

  • 2

    您可以使用Pandas来读取您的输入文件并根据您想要的任何列删除重复项。

    from StringIO import StringIO 
    from pandas import DataFrame 
    
    data =StringIO("""col1,col2,col3,col4,col5,col6 
    3,3090,21,f,2,3 
    4,231,22,m,2,3 
    5,9427,13,f,2,2 
    6,9942,7,m,2,3 
    7,6802,33,f,3,2 
    8,8579,11,f,2,4 
    9,8598,11,f,2,4 
    10,16729,23,m,1,1 
    11,8472,11,f,3,4 
    12,10976,21,f,3,3 
    13,2870,21,f,2,3 
    14,12032,10,f,3,4 
    15,16999,13,m,2,2 
    16,570,7,f,2,3 
    17,8485,11,f,2,4 
    18,8728,11,f,3,4 
    19,20861,9,f,2,2 
    20,19771,34,f,2,2 
    21,17964,10,f,2,2""") 
    
    df = DataFrame.from_csv(data, sep=",", index_col=False) 
    df.drop_duplicates(subset='col2') 
    df.to_csv("no_dups.txt", index = false)