使用python解析结果

我是一名Python初学者（我是一名生物学家），我有一个包含特定软件结果的文件，我想用python解析结果。从下面的输出中，我想得到的只是分数，并希望将序列分成单独的氨基酸。使用python解析结果

没有。得分顺序

1 0.273778 FFHH-YYFLHRRRKKCCNNN-CCCK---HQQ---HHKKHV-FGGGE-EDDEDEEEEEEEE-EE-- 
2 0.394647 IIVVIVVVVIVVVVVVVVVV-CCCVA-IVVI--LIIIIIIIIYYYA-AVVVVVVVAAAAV-AST- 
3 0.456667  FIVVIVVVVIXXXXIGGGGT-CCCCAV -------------IVBBB-AAAAAA--------AAAA- 
4 0.407581 MMLMILLLLMVVAIILLIII-LLLIVLLAVVVVVAAAVAAVAIIII-ILIIIIIILVIMKKMLA- 
5 0.331761 AANSRQSNAAQRRQCSNNNR-RALERGGMFFRRKQNNQKQKKHHHY-FYFYYSNNWWFFFFFFR- 
6 0.452381 EEEEDEEEEEEEEEEEEEEE-EEEEESSTSTTTAEEEEEEEEEEEE-EEEEEEEEEEEEEEEEE- 
7 0.460385 LLLLLLLLMMIIILLLIIII-IIILLVILMMEEFLLLLILIVLLLM-LLLLLLLLLLVILLLVL- 
8 0.438680 ILILLVVVVILVVVLQLLMM-QKQLIVVLLVIIMLLLLMLLSIIIS-SMMMILFFLLILIIVVL- 
9 0.393291 QQQDEEEQAAEEEDEKGSSD-QQEQDDQDEEAAAHQLESSATVVQR-QQQQQVVYTHSTVTTTE-

从上面的表格，我想获得相同数量，分数，但其单独的序列表（纵列）所以它应该看起来像

no.  score   amino acid(1st column) 

1  0.273778   F 

2  0.395657   I 

3  0.456667   F

另一个表代表第二列氨基酸

no  score  amino acid (2nd column) 

1  0.273778   F 

2  0.395657   I 

3  0.456667   I

第三个表代表氨基酸的第三列和第四个第四列的表氨基酸MN等

在此先感谢您的帮助

来源

2011-12-12 hari

什么的'F'，'I'和'F'立场？这些是上面字符串的第一个字符吗？为什么'f'在第三行而不是'F'？我们不是Python的初学者，但我们也不是生物学家。我们可以用Python来帮助你，但你必须解释这里的个别氨基酸是什么。 – eumiro

它应该为F ...我已编辑了问题（F，I; F）。是氨基酸代码，这是alignment.I愿与得分分裂整个sequnece纵列的序列的结果和序列号。 – hari

你的描述如何去信件仍然不完全清楚。也许最好在序列中添加一些例子以及如何获得理想的结果。 – hochl

从你的例子，我想这：

要每个表保存到不同的结果文件。
每个序列长65个字符
一些序列包含无意义的空格，其具有（在你的例子线3）被移除

这是我的代码示例中，它从input.dat读取数据和写入结果result-column-<number>.dat：

在本例中使用

import re 
import sys 

# I will write each table to different results-file. 
# dictionary to map columns (numbers) to opened file objects: 
resultfiles = {} 


def get_result_file(column): 
    # helper to easily access results file. 
    if column not in resultfiles: 
     resultfiles[column] = open('result-column-%d.dat' % column, 'w') 
    return resultfiles[column] 


# iterate over data: 
for line in open('input.dat'): 
    try: 
     # str.split(separator, maxsplit) 
     # with `maxsplit`=2 it is more fail-proof: 
     no, score, seq = line.split(None, 2) 

     # from your example I guess that white-spaces in sequence are meaningless, 
     # however in your example one sequence contains white-space, so I remove it: 
     seq = re.sub('\s+', '', seq) 

     # data validation will help to spot problems early: 
     assert int(no), no   
     assert float(score), score 
     assert len(seq) == 65, seq 

    except Exception, e: 
     # print the error and continue to process data: 
     print >> sys.stderr, 'Error %s in line: %s.' % (e, line) 
     continue # jump to next iteration of for loop. 

    # int(), float() will rise ValueError if no or score aren't numbers 
    # assert <condition> will rise AssertionError if condition is False. 

    # iterate over each character in amino sequance: 
    for column, char in enumerate(seq, 1): 
     f = get_result_file(column) 
     f.write('%s %s %s\n' % (no, score, char)) 


# close all opened result files: 
for f in resultfiles.values(): 
    f.close()

值得注意的功能：

来源

2011-12-12 13:55:03 Ski

感谢您的帮助，我得到了线26.assert INT错误（无），没有 ValueError异常：对于int（）无效文字基数为10：“#column” – hari

你可以找到你的数据文件中的行包含文字“'＃列”“？你能通过编辑你的问题向我展示那条线吗？从你提供的数据样本来看，这个错误不能上升。 – Ski

我提供的数据仅仅是一个例子，也是我想要的“ - ”我的数据，以及他们的意思something.i不知道我可以上传我的整个结果文件，可能是它可以帮助，不后悔能够正确地解释.. – hari

假设您已经打开包含数据f文件，那么你的例可以用复制：

for ln in f: # loop over all lines 
    seqno, score, seq = ln.split() 
    print("%s %s %s" % (seqno, score, seq[0]))

要拆出的顺序，你需要另外遍历所有的字母seq：

for ln in f: 
    seqno, score, seq = ln.split() 
    for x in seq: 
     print("%s %s %s" % (seqno, score, seq[0]))

这将打印序列NU mber和得分很多次。我不确定这是你想要的。

来源

2011-12-12 11:51:46

如果你打算用序列进一步做任何事情，我建议将其转换为Biopython（www.biopython.org）Sequence对象。 – 2011-12-12 12:03:13

感谢您的建议，我想只是分割序列，我已编辑相应的问题。 – hari

我不认为它是有用的创建表。
只要把数据在调整结构和使用功能，显示你需要在你需要的时刻是什么：

with open('bio.txt') as f: 
    data = [line.rstrip().split(None,2) for line in f if line.strip()] 


def display(data,nth,pat='%-6s %-15s %s',uz=('th','st','nd','rd')): 
    print pat % ('no.','score', 
       'amino acid(%d%s column)' %(nth,uz[0 if nth//4 else nth])) 
    print '\n'.join(pat % (a,b,c[nth-1]) for a,b,c in data)  

display(data,1) 
print 
display(data,3) 
print 
display(data,7)

结果

no.  score   amino acid(1st column) 
1  0.273778   F 
2  0.394647   I 
3  0.456667   F 
4  0.407581   M 
5  0.331761   A 
6  0.452381   E 
7  0.460385   L 
8  0.438680   I 
9  0.393291   Q 

no.  score   amino acid(3rd column) 
1  0.273778   H 
2  0.394647   V 
3  0.456667   V 
4  0.407581   L 
5  0.331761   N 
6  0.452381   E 
7  0.460385   L 
8  0.438680   I 
9  0.393291   Q 

no.  score   amino acid(7th column) 
1  0.273778   Y 
2  0.394647   V 
3  0.456667   V 
4  0.407581   L 
5  0.331761   S 
6  0.452381   E 
7  0.460385   L 
8  0.438680   V 
9  0.393291   E

来源

2011-12-12 15:07:09 eyquem

下面是一个简单可行的解决方案：

#opening file: "db.txt" full path to file if it is in the same directory as python file 
#you can use any extension for the file ,'r' for reading mode 
filehandler=open("db.txt",'r') 
#Saving all the lines once in a list every line is a list member 
#Another way: you can read it line by line 
LinesList=filehandler.readlines() 
#creating an empty multi dimension list to store your results 
no=[] 
Score=[] 
AminoAcids=[] # this is a multi-dimensional list for example index 0 has a list of char. of first line and so on 
#process each line assuming constant spacing in the input file 
#no is the first char. score from char 4 to 12 and Amino from 16 to end 
for Line in LinesList: 
    #add the no 
    no.append(Line[0]) 
    #add the score 
    Score.append(Line[4:12]) 
    Aminolist=list(Line[16:]) #breaking the amino acid as each character is a list element 
    #add Aminolist to the AminoAcids Matrix (multi-dimensional array) 
    AminoAcids.append(Aminolist) 

#you can now play with the data! 
#printing Tables ,you can also write them into a file instead 
for k in range(0,65): 
    print"Table %d" %(k+1) # adding 1 to not be zero indexed 
    print"no. Score  amino acid(column %d)" %(k+1) 
    for i in range(len(no)): 
     print "%s %s %s" %(no[i],Score[i],AminoAcids[i][k])

这里是结果的一部分出现在控制台上：

Table 1 
no. Score  amino acid(column 1) 
1 0.273778 F 
2 0.394647 I 
3 0.456667 F 
4 0.407581 M 
5 0.331761 A 
6 0.452381 E 
7 0.460385 L 
8 0.438680 I 
9 0.393291 Q 
Table 2 
no. Score  amino acid(column 2) 
1 0.273778 F 
2 0.394647 I 
3 0.456667 I 
4 0.407581 M 
5 0.331761 A 
6 0.452381 E 
7 0.460385 L 
8 0.438680 L 
9 0.393291 Q 
Table 3 
no. Score  amino acid(column 3) 
1 0.273778 H 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 N 
6 0.452381 E 
7 0.460385 L 
8 0.438680 I 
9 0.393291 Q 
Table 4 
no. Score  amino acid(column 4) 
1 0.273778 H 
2 0.394647 V 
3 0.456667 V 
4 0.407581 M 
5 0.331761 S 
6 0.452381 E 
7 0.460385 L 
8 0.438680 L 
9 0.393291 D 
Table 5 
no. Score  amino acid(column 5) 
1 0.273778 - 
2 0.394647 I 
3 0.456667 I 
4 0.407581 I 
5 0.331761 R 
6 0.452381 D 
7 0.460385 L 
8 0.438680 L 
9 0.393291 E 
Table 6 
no. Score  amino acid(column 6) 
1 0.273778 Y 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 Q 
6 0.452381 E 
7 0.460385 L 
8 0.438680 V 
9 0.393291 E 
Table 7 
no. Score  amino acid(column 7) 
1 0.273778 Y 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 S 
6 0.452381 E 
7 0.460385 L 
8 0.438680 V 
9 0.393291 E

来源

2011-12-13 13:53:38 Abdurahman

使用python解析结果

回答

相关问题