2011-11-03 75 views
1

我有一个文本文件(领结对齐文件)看起来像这样的递减和更新值的字段:蟒蛇:如何分隔文本

 
read_1 + 345995|PACid:16033981 599 AGTAGTAATCAGTCACCCGCAAGGTAGACAAGG qqqqqqqqqqqqqqqqqqqqq!!qqqqqqqqqq 0 
read_2 + 949205|PACid:16054220 338 TACCAGCACTAATGCACCGGATCCCATCAGATC qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq!!q 0 31:A>T 
read_3 + 932004|PACid:16034380 1226 GGCACCTTATGAGAAATCAAAGTTTTTGGGTTC qqqqqqqqqqqqqqq!!qqqqqqqqqqqqq!!q 3 

我要减一列#4(位置),并用更新的值打印每一行。

我可以读取文件,然后根据选项卡分隔字段,并将第4列标识为data[3],但之后我坚持从第4列的每个值中减去一个值,并打印每行中的所有字段更新了第4列的值。

我该如何使用Python来做到这一点?

我想是这样的:

in_file = open(sys.argv[1],'r') 
out_file = open(sys.argv[2], 'w') 
for line in in_file: 
    data = line.rstrip().split('\t') 
    position = int(float(data[3]) -1) 

,但我不知道如何与打印与更新的位置的线进行。

+1

问题的哪个部分卡住了? (阅读文件?识别第四列?减法?打印?) – Johnsyweb

+1

嘿!我意识到这是我DNA序列的一部分。你从哪里得到那个的?高级互联网和它的隐私缺乏! :-) – paxdiablo

+1

作为一个方面说明,是否有必要使用Python?因为awk很容易实现,比如'awk'BEGIN {OFS =“\ t”} NF> 0 {$ 4 - = 1;打印}' out.txt' –

回答

1

使用csv module,通知它你的字段分隔符是一个制表:

from io import StringIO 

indata = StringIO(u"""read_1 + 345995|PACid:16033981 599 AGTAGTAATCAGTCACCCGCAAGGTAGACAAGG qqqqqqqqqqqqqqqqqqqqq!!qqqqqqqqqq 0 
read_2 + 949205|PACid:16054220 338 TACCAGCACTAATGCACCGGATCCCATCAGATC qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq!!q 0 31:A>T 
read_3 + 932004|PACid:16034380 1226 GGCACCTTATGAGAAATCAAAGTTTTTGGGTTC qqqqqqqqqqqqqqq!!qqqqqqqqqqqqq!!q 3 
""") 

# that StringIO stuff is just for testing, you should do 
# with open('your_file_name', 'r') as indata: 
# before the 'for' loop, and then indent the rest one level. 

from csv import reader 

for line in reader(indata, delimiter='\t'): 
    if len(line) > 3: 
     line[3] = str(int(line[3]) - 1) 
    print '\t'.join(line) 

然后,只需转换位置的数字,减去一个,将其转换回,并打印线。

+0

谢谢,但我得到上述代码错误:行[3] = str(int(行[3]) - 1) IndexError:列表索引超出范围。我需要打印原始文件中的所有字段,更新第4列。我试图做这样的事情,in_file = open(sys.argv [1],'r') out_file = open(sys.argv [2],'w') in_file中的行: data = line .rstrip()。split('\ t') position = int(float(data [3])-1,但我不确定如何继续打印更新位置的行 – psaima

+0

@psaima然后,你可以在'for'循环里添加一个'if len(line)> 3:'test来过滤掉坏行,我将它编辑为 – agf

+0

@psaima基本上你现有的代码几乎是正确的,只要将'position = int(float(data [3])-1'改为data [3] = str(int(data [3]) - 1 )',那么你可以按照我的方式'print'\ t'.join(data)'。 – agf