2011-12-12 220 views
1

我是一名Python初学者(我是一名生物学家),我有一个包含特定软件结果的文件,我想用python解析结果。从下面的输出中,我想得到的只是分数,并希望将序列分成单独的氨基酸。使用python解析结果

没有。得分顺序

1 0.273778 FFHH-YYFLHRRRKKCCNNN-CCCK---HQQ---HHKKHV-FGGGE-EDDEDEEEEEEEE-EE-- 
2 0.394647 IIVVIVVVVIVVVVVVVVVV-CCCVA-IVVI--LIIIIIIIIYYYA-AVVVVVVVAAAAV-AST- 
3 0.456667  FIVVIVVVVIXXXXIGGGGT-CCCCAV -------------IVBBB-AAAAAA--------AAAA- 
4 0.407581 MMLMILLLLMVVAIILLIII-LLLIVLLAVVVVVAAAVAAVAIIII-ILIIIIIILVIMKKMLA- 
5 0.331761 AANSRQSNAAQRRQCSNNNR-RALERGGMFFRRKQNNQKQKKHHHY-FYFYYSNNWWFFFFFFR- 
6 0.452381 EEEEDEEEEEEEEEEEEEEE-EEEEESSTSTTTAEEEEEEEEEEEE-EEEEEEEEEEEEEEEEE- 
7 0.460385 LLLLLLLLMMIIILLLIIII-IIILLVILMMEEFLLLLILIVLLLM-LLLLLLLLLLVILLLVL- 
8 0.438680 ILILLVVVVILVVVLQLLMM-QKQLIVVLLVIIMLLLLMLLSIIIS-SMMMILFFLLILIIVVL- 
9 0.393291 QQQDEEEQAAEEEDEKGSSD-QQEQDDQDEEAAAHQLESSATVVQR-QQQQQVVYTHSTVTTTE- 

从上面的表格,我想获得相同数量,分数,但其单独的序列表(纵列) 所以它应该看起来像

no.  score   amino acid(1st column) 

1  0.273778   F 

2  0.395657   I 

3  0.456667   F 

另一个表代表第二列氨基酸

no  score  amino acid (2nd column) 

1  0.273778   F 

2  0.395657   I 

3  0.456667   I 

第三个表代表氨基酸的第三列和第四个第四列的表氨基酸MN等

在此先感谢您的帮助

+3

什么的'F','I'和'F'立场?这些是上面字符串的第一个字符吗?为什么'f'在第三行而不是'F'?我们不是Python的初学者,但我们也不是生物学家。我们可以用Python来帮助你,但你必须解释这里的个别氨基酸是什么。 – eumiro

+0

它应该为F ...我已编辑了问题(F,I; F)。是氨基酸代码,这是alignment.I愿与得分分裂整个sequnece纵列的序列的结果和序列号。 – hari

+0

你的描述如何去信件仍然不完全清楚。也许最好在序列中添加一些例子以及如何获得理想的结果。 – hochl

回答

0

从你的例子,我想这:

  • 要每个表保存到不同的结果文件。
  • 每个序列长65个字符
  • 一些序列包含无意义的空格,其具有(在你的例子线3)被移除

这是我的代码示例中,它从input.dat读取数据和写入结果result-column-<number>.dat

在本例中使用
import re 
import sys 

# I will write each table to different results-file. 
# dictionary to map columns (numbers) to opened file objects: 
resultfiles = {} 


def get_result_file(column): 
    # helper to easily access results file. 
    if column not in resultfiles: 
     resultfiles[column] = open('result-column-%d.dat' % column, 'w') 
    return resultfiles[column] 


# iterate over data: 
for line in open('input.dat'): 
    try: 
     # str.split(separator, maxsplit) 
     # with `maxsplit`=2 it is more fail-proof: 
     no, score, seq = line.split(None, 2) 

     # from your example I guess that white-spaces in sequence are meaningless, 
     # however in your example one sequence contains white-space, so I remove it: 
     seq = re.sub('\s+', '', seq) 

     # data validation will help to spot problems early: 
     assert int(no), no   
     assert float(score), score 
     assert len(seq) == 65, seq 

    except Exception, e: 
     # print the error and continue to process data: 
     print >> sys.stderr, 'Error %s in line: %s.' % (e, line) 
     continue # jump to next iteration of for loop. 

    # int(), float() will rise ValueError if no or score aren't numbers 
    # assert <condition> will rise AssertionError if condition is False. 

    # iterate over each character in amino sequance: 
    for column, char in enumerate(seq, 1): 
     f = get_result_file(column) 
     f.write('%s %s %s\n' % (no, score, char)) 


# close all opened result files: 
for f in resultfiles.values(): 
    f.close() 

值得注意的功能:

+0

感谢您的帮助,我得到了线26.assert INT错误(无),没有 ValueError异常:对于int()无效文字基数为10:“#column” – hari

+0

你可以找到你的数据文件中的行包含文字“'#列”“?你能通过编辑你的问题向我展示那条线吗?从你提供的数据样本来看,这个错误不能上升。 – Ski

+0

我提供的数据仅仅是一个例子,也是我想要的“ - ”我的数据,以及他们的意思something.i不知道我可以上传我的整个结果文件,可能是它可以帮助,不后悔能够正确地解释.. – hari

5

假设您已经打开包含数据f文件,那么你的例可以用复制:

for ln in f: # loop over all lines 
    seqno, score, seq = ln.split() 
    print("%s %s %s" % (seqno, score, seq[0])) 

要拆出的顺序,你需要另外遍历所有的字母seq

for ln in f: 
    seqno, score, seq = ln.split() 
    for x in seq: 
     print("%s %s %s" % (seqno, score, seq[0])) 

这将打印序列NU mber和得分很多次。我不确定这是你想要的。

+1

如果你打算用序列进一步做任何事情,我建议将其转换为Biopython(www.biopython.org)Sequence对象。 – 2011-12-12 12:03:13

+0

感谢您的建议,我想只是分割序列,我已编辑相应的问题。 – hari

0

我不认为它是有用的创建表。
只要把数据在调整结构和使用功能,显示你需要在你需要的时刻是什么:

with open('bio.txt') as f: 
    data = [line.rstrip().split(None,2) for line in f if line.strip()] 


def display(data,nth,pat='%-6s %-15s %s',uz=('th','st','nd','rd')): 
    print pat % ('no.','score', 
       'amino acid(%d%s column)' %(nth,uz[0 if nth//4 else nth])) 
    print '\n'.join(pat % (a,b,c[nth-1]) for a,b,c in data)  

display(data,1) 
print 
display(data,3) 
print 
display(data,7) 

结果

no.  score   amino acid(1st column) 
1  0.273778   F 
2  0.394647   I 
3  0.456667   F 
4  0.407581   M 
5  0.331761   A 
6  0.452381   E 
7  0.460385   L 
8  0.438680   I 
9  0.393291   Q 

no.  score   amino acid(3rd column) 
1  0.273778   H 
2  0.394647   V 
3  0.456667   V 
4  0.407581   L 
5  0.331761   N 
6  0.452381   E 
7  0.460385   L 
8  0.438680   I 
9  0.393291   Q 

no.  score   amino acid(7th column) 
1  0.273778   Y 
2  0.394647   V 
3  0.456667   V 
4  0.407581   L 
5  0.331761   S 
6  0.452381   E 
7  0.460385   L 
8  0.438680   V 
9  0.393291   E 
0

下面是一个简单可行的解决方案:

#opening file: "db.txt" full path to file if it is in the same directory as python file 
#you can use any extension for the file ,'r' for reading mode 
filehandler=open("db.txt",'r') 
#Saving all the lines once in a list every line is a list member 
#Another way: you can read it line by line 
LinesList=filehandler.readlines() 
#creating an empty multi dimension list to store your results 
no=[] 
Score=[] 
AminoAcids=[] # this is a multi-dimensional list for example index 0 has a list of char. of first line and so on 
#process each line assuming constant spacing in the input file 
#no is the first char. score from char 4 to 12 and Amino from 16 to end 
for Line in LinesList: 
    #add the no 
    no.append(Line[0]) 
    #add the score 
    Score.append(Line[4:12]) 
    Aminolist=list(Line[16:]) #breaking the amino acid as each character is a list element 
    #add Aminolist to the AminoAcids Matrix (multi-dimensional array) 
    AminoAcids.append(Aminolist) 

#you can now play with the data! 
#printing Tables ,you can also write them into a file instead 
for k in range(0,65): 
    print"Table %d" %(k+1) # adding 1 to not be zero indexed 
    print"no. Score  amino acid(column %d)" %(k+1) 
    for i in range(len(no)): 
     print "%s %s %s" %(no[i],Score[i],AminoAcids[i][k]) 

这里是结果的一部分出现在控制台上:

Table 1 
no. Score  amino acid(column 1) 
1 0.273778 F 
2 0.394647 I 
3 0.456667 F 
4 0.407581 M 
5 0.331761 A 
6 0.452381 E 
7 0.460385 L 
8 0.438680 I 
9 0.393291 Q 
Table 2 
no. Score  amino acid(column 2) 
1 0.273778 F 
2 0.394647 I 
3 0.456667 I 
4 0.407581 M 
5 0.331761 A 
6 0.452381 E 
7 0.460385 L 
8 0.438680 L 
9 0.393291 Q 
Table 3 
no. Score  amino acid(column 3) 
1 0.273778 H 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 N 
6 0.452381 E 
7 0.460385 L 
8 0.438680 I 
9 0.393291 Q 
Table 4 
no. Score  amino acid(column 4) 
1 0.273778 H 
2 0.394647 V 
3 0.456667 V 
4 0.407581 M 
5 0.331761 S 
6 0.452381 E 
7 0.460385 L 
8 0.438680 L 
9 0.393291 D 
Table 5 
no. Score  amino acid(column 5) 
1 0.273778 - 
2 0.394647 I 
3 0.456667 I 
4 0.407581 I 
5 0.331761 R 
6 0.452381 D 
7 0.460385 L 
8 0.438680 L 
9 0.393291 E 
Table 6 
no. Score  amino acid(column 6) 
1 0.273778 Y 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 Q 
6 0.452381 E 
7 0.460385 L 
8 0.438680 V 
9 0.393291 E 
Table 7 
no. Score  amino acid(column 7) 
1 0.273778 Y 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 S 
6 0.452381 E 
7 0.460385 L 
8 0.438680 V 
9 0.393291 E