2009-06-23 74 views
1

我有一个问题,我可以使用一些帮助,我有一个看起来像这样的Python列表:Python列表问题

fail = [ 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'] 
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'] 
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'] 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'] 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', 'svin.txt'] 
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', 'apa2.txt'] 
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt'] 

sha1 value, directory, filename 

我想要的是在两个不同的列表此内容基础上,SHA1值分开和目录。例如。

['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt'] 
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'] 

我想添加到列表中duplicate = [],因为它是在用相同的SHA1值(只有该目录)的目录中。其他条目我想添加到另一个列表中,比如diff = [],因为sha1的值是相同的,但目录不同。

我有点失落于这里的逻辑,所以所有的帮助,我可以得到感谢!

编辑:修正了一个错字,最后的值(文件名)在某些情况下,1-list元素,这是100%不正确,thansk到SilentGhost成为注意到这个问题。

+0

试图解释你正在试图做的更清晰一点什么。 – 2009-06-23 18:04:15

+0

完全不清楚你想要做什么。预期的全部产出是多少? – 2009-06-23 18:05:25

+0

喜欢文件名,哈哈。 – Skurmedel 2009-06-23 18:16:27

回答

3
duplicate = [] 
# Sort the list so we can compare adjacent values 
fail.sort() 
#if you didn't want to modify the list in place you can use: 
#sortedFail = sorted(fail) 
#  and then use sortedFail in the rest of the code instead of fail 
for i, x in enumerate(fail): 
    if i+1 == len(fail): 
     #end of the list 
     break 
    if x[:2] == fail[i+1][:2]: 
     if x not in duplicate: 
      duplicate.add(x) 
     if fail[i+1] not in duplicate: 
      duplicate.add(fail[i+1]) 
# diff is just anything not in duplicate as far as I can tell from the explanation 
diff = [d for d in fail if d not in duplicate] 

你的榜样输入

duplicate: [ 
       ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']], 
       ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'] 
      ] 

diff: [ 
      ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']], 
      ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'], 
      ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']], 
      ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'], 
      ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'] 
     ] 

因此,也许我错过了什么,但我认为这是你问什么。

1

你可以简单地通过所有的值循环再使用一个内部循环比较的目录,那么如果该目录是相同的比较值,然后分配名单。这会给你一个体面的n^2算法来整理它。

也许像这样未经测试的代码:

>>>for i in range(len(fail)-1): 
... dir = fail[i][1] 
... sha1 = fail[i][0] 
... for j in range(i+1,len(fail)): 
...  if dir == fail[j][1]: #is this how you compare strings? 
...   if sha1 == fail[j][0]: 
...   #remove from fail and add to duplicate and add other to diff 

再次代码是未经测试。

0

下面是使用字典按sha和目录进行分组的另一种方法。这也摆脱了文件名中的随机列表。

new_fail = {}  # {sha: {dir: [filenames]}} 
for item in fail: 
    # split data into it's parts 
    sha, directory, filename = item 

    # make sure the correct elements exist in the data structure 
    if sha not in new_fail: 
     new_fail[sha] = {} 
    if directory not in new_fail[sha]: 
     new_fail[sha][directory] = [] 

    # this is where the lists are removed from the file names 
    if type(filename) == type([]): 
     filename = filename[0] 

    new_fail[sha][directory].append(filename) 

diff = [] 
dup = [] 

# loop through the data, analyzing it 
for sha, val in new_fail.iteritems(): 
    for directory, filenames in val.iteritems(): 

     # check to see if the sha/dir combo has more than one file name 
     if len(filenames) > 1: 
      for filename in filenames: 
       dup.append([sha, directory, filename]) 
     else: 
      diff.append([sha, dir, filenames[0]]) 

要打印:

print 'diff:' 
for i in diff: 
    print i 
print '\ndup:' 
for i in dup: 
    print i 

的样本数据是这样的:

 
diff: 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'] 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'] 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', 'svin.txt'] 
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', 'apa2.txt'] 
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'] 

dup: 
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'] 
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt']
1

在下面的代码示例中,我使用基于SHA1和目录名一键检测独特和重复的条目和备用字典的家政。

# Test dataset 
fail = [ 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'], 
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'], 
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'], 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'], 
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']], 
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']], 
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']], 
] 


def sort_duplicates(filelist): 
    """Returns a tuplie whose first element is a list of unique files, 
    and second element is a list of duplicate files. 
    """ 
    diff = [] 
    diff_d = {} 

    duplicate = [] 
    duplicate_d = {} 

    for entry in filelist: 

     # Make an immutable key based on the SHA-1 and directory strings 
     key = (entry[0], entry[1]) 

     # If this entry is a known duplicate, add it to the duplicate list 
     if key in duplicate_d: 
      duplicate.append(entry) 

     # If this entry is a new duplicate, add it to the duplicate list 
     elif key in diff_d: 
      duplicate.append(entry) 
      duplicate_d[key] = entry 

      # And relocate the matching entry to the duplicate list 
      matching_entry = diff_d[key] 
      duplicate.append(matching_entry) 
      duplicate_d[key] = matching_entry 
      del diff_d[key] 
      diff.remove(matching_entry) 

     # Otherwise add this entry to the different list 
     else: 
      diff.append(entry) 
      diff_d[key] = entry 

    return (diff, duplicate) 

def test(): 
    global fail 
    diff, dups = sort_duplicates(fail) 
    print "Diff:", diff 
    print "Dups:", dups 

test() 
0

我相信接受的答案会稍微更有效(Python的内部排序应该比我的字典里散步快),但因为我已经想出了这个,我也不妨将它张贴。 :-)

此技术使用多级字典,以避免这两个排序和显式的比较。

hashes = {} 
diff = [] 
dupe = [] 

# build the dictionary 
for sha, path, files in fail: 
    try: 
     hashes[sha][path].append(files) 
    except KeyError: 
     try: 
      hashes[sha][path] = [files] 
     except: 
      hashes[sha] = dict((path, [files])) 

for sha, paths in hashes.iteritems(): 
    if len(paths) > 1: 
     for path, files in paths.iteritems(): 
      for file in files: 
       diff.append([sha, path, file]) 
    for path, files in paths.iteritems(): 
     if len(files) > 1: 
      for file in files: 
       dupe.append([sha, path, file]) 

结果将是:

diff = [ 
    ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'], 
    ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'], 
    ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']], 
    ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']], 
    ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'] 
] 
dupe = [ 
    [['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'], 
    ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']] 
]