我有一个名为aa_seq几百个氨基酸序列的列表中，它看起来像这样：[“AFYIVHPMFSELINFQNEGHECQCQCG”，“KVHSLPGMSDNGSPAVLPKTEFNKYKI”，“RAQVEDLMSLSPHVENASIPKGSTPIP”，“TSTNNYPMVQEQAILSCIEQTMVADAK” ,. ..]。每个序列长度为27个字母。我必须确定每个位置（1-27）最常用的氨基酸和频率。增加计数器作为一个字典值的循环

到目前为止，我有：

count_dict = {} 
    counter = count_dict.values() 
    aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' , #one-letter code for amino acids 
     'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y'] 
    for p in range(0,26):      #first round:looks at the first position in each sequence 
     for s in range(0,len(aa_seq)):   #goes through all sequences of the list 
      for item in aa_list:    #and checks for the occurrence of each amino acid letter (=item) 
        if item in aa_seq[s][p]: 
         count_dict[item]   #if that letter occurs at the respective position, make it a key in the dictionary 
         counter += 1    #and increase its counter (the value, as definded above) by one 
    print count_dict

它说KeyError异常： 'A'，它的指向线count_dict [项目]。所以aa_list的项目显然不能用这种方式添加为关键字..？我怎么做？它也给出了一个错误，“'int'对象不可迭代”关于计数器。如何可以增加柜台？

来源

2017-04-09 ccaarroo

什么是你想用'count_dict [项目]'？即使该词典中存在“item”，只要查找该值并立即将其丢弃;你不会在那里分配任何东西。 –

另外，'counter'被定义为count_dict开始时的值列表;它是一个空列表，因为count_dict是空的。所以'counter + = 1'没有意义，因为你不能在列表中添加一个整数。 –

与像C++这样的语言不同，您可以简单地引用它们来初始化字典（映射）条目，但在python中，您需要显式初始化字典条目。 – Unlocked

将项目添加到dictionnary，你必须将其初始化为值：

if item not in count_dict: 
    count_dict[item]=0

可以使用setdefault函数来执行这个作为一个班轮：

count_dict.setdefault(item,0)

来源

2017-04-09 21:13:38 WNG

这如何快速记录字典中的项目，只需将其添加到您创建的任何代码中

count_dict = {} 

aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' , 
     'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y'] 

for element in aa_list: 
    count_dict[element]=(count_dict).get(element,0)+1 

print (count_dict)

来源

2017-04-09 21:18:00 citizen2077

您可以使用Counter类

>>> from collections import Counter 

>>> l = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK'] 
>>> s = [Counter([l[j][i] for j in range(len(l))]).most_common()[0] for i in range(27)] 
>>> s 
[('A', 1), 
('A', 1), 
('Y', 1), 
('I', 1), 
('N', 1), 
('Y', 1), 
('P', 2), 
('M', 4), 
('S', 2), 
('Q', 1), 
('E', 2), 
('Q', 1), 
('I', 1), 
('I', 1), 
('A', 1), 
('Q', 1), 
('A', 1), 
('I', 1), 
('I', 1), 
('Q', 1), 
('E', 2), 
('C', 1), 
('Q', 1), 
('A', 1), 
('Q', 1), 
('I', 1), 
('I', 1)]

但是如果你有大量的数据集我可能是方式效率低下。

来源

2017-04-09 21:22:09 greole

啊，这很酷，我可以试试。但是'most_common（）[0]'做了什么，因为输出只是给出了所有字母的数量..？ – ccaarroo

@ccaarroo：列表是所需的信息。第一个元组是序列中索引为0的最常见字符，出现次数为1。例如，您可以看到索引7处的“M”出现了4次。 –

'most_common（[n]）'列出n个最常见的元素。因此'most_common（）[0]'在位置i打印出最常见的单个元素。 – greole

修改后的代码

这是您的代码的修改后的工作版本。它效率不高，但应输出正确的结果。

的几个注意事项：

你需要为每个索引一个计数器。所以你应该在第一个循环中初始化你的字典。
range(0,26)只有26个元素：从0到25（含）。
defaultdict可帮助您为每个起始值定义0。
您需要增加计数器count_dict[item] += 1
在每个循环结束时，您需要找到具有最高值（出现）的关键字（氨基酸）。

from collections import defaultdict 

aa_seq = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 
      'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK'] 
aa_list = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', # one-letter code for amino acids 
      'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'] 

for p in range(27):     # first round:looks at the first position in each sequence 
    count_dict = defaultdict(int) # initialize counter with 0 as default value 
    for s in range(0, len(aa_seq)): # goes through all sequences of the list 
     # and checks for the occurrence of each amino acid letter (=item) 
     for item in aa_list: 
      if item in aa_seq[s][p]: 
       # if that letter occurs at the respective position, make it a 
       # key in the dictionary 
       count_dict[item] += 1 
    print(max(count_dict.items(), key=lambda x: x[1]))

它输出：

('R', 1) 
('S', 1) 
('Y', 1) 
('S', 1) 
('E', 1) 
('P', 1) 
('P', 2) 
('M', 4) 
...

与反

替代你不需要很多的循环，你只需要在每个序列的每个字符遍历一次。

此外，不需要重新发明轮子：Counter和most_common是比defaultdict和max更好的替代方案。

from collections import Counter 

aa_seqs = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK'] 

counters = [Counter() for i in range(27)] 

for aa_seq in aa_seqs: 
    for (i, aa) in enumerate(aa_seq): 
     counters[i][aa] += 1 

most_commons = [counter.most_common()[0] for counter in counters] 
print(most_commons)

它输出：

[('K', 1), ('A', 1), ('Y', 1), ('N', 1), ('N', 1), ('Y', 1), ('P', 2), ('M', 4), ('S', 2), ('Q', 1), ('E', 2), ('G', 1), ('H', 1), ('N', 1), ('L', 1), ('N', 1), ('N', 1), ('I', 1), ('G', 1), ('H', 1), ('E', 2), ('G', 1), ('N', 1), ('K', 1), ('Y', 1), ('K', 1), ('G', 1)]

来源

2017-04-09 21:31:12

增加计数器作为一个字典值的循环

回答

修改后的代码

与反

相关问题