2016-12-02 64 views
0

我知道如何两条线之间的解析,当起“目标字”和最终“目标字”是不同的两条线之间的解析的Python:用相同的关键字

例如如果我想X和Y之间解析:

parse = False 
for line in open(sys.argv[1]): 
if Y in line: 
    parse = False 
if parse: 
    print line 
if X in line: 
    parse = True 

我卡在一个稍微不同的问题,在这里我想与解析的词是同一个词。即,在此实例中,有4个不同的同系物基团,并且我想提取每个同系物组中的人/小鼠对,所以我想打开该文件:

1:_HomoloGene:_141209.Gene_conserved_in_Mammals 
LOC102724657       Homo_sapiens 
Gm12569         Mus_musculus 
2:_HomoloGene:_141208.Gene_conserved_in_Euarchontoglires  
LOC102724737       Homo_sapiens 
LOC102636216       Mus_musculus 
3:_HomoloGene:_141152.Gene_conserved_in_Euarchontoglires  
LOC728763        Homo_sapiens 
E030010N07Rik       Mus_musculus 
E030010N09Rik       Mus_musculus 
E030010N010Rik       Mus_musculus 
E030010N08Rik       Mus_musculus 
LOC102551034       Rattus_norvegicus 
4:_HomoloGene:_141054.Gene_conserved_in_Boreoeutheria  
LOC102723572       Homo_sapiens 
LOC102157295       Canis_lupus_familiaris 
LOC102633228       Mus_musculus 

向一个Homo_sapiens /小家鼠比较像这样的:

Homo_sapiens Mus_musculus 
LOC102724657 Gm12569 
LOC102724737 LOC102636216 
LOC728763  E030010N07Rik 
LOC728763  E030010N09Rik 
LOC728763  E030010N010Rik 
LOC728763  E030010N08Rik 
LOC102723572 LOC102633228 

我没有几乎成功的代码来显示,这是什么,我已经试过一个例子(和我也试了正则表达式和分裂的字行“HomoloGene” ):

import sys 
ListOfLines = open(sys.argv[1]) 
for line in ListOfLines: 
     if "HomoloGene" in line: 
       if "HomoloGene" in ListOfLines.next(): 
         print line 
         print "**" 
       else: 
         print ListOfLines.next() 

谢谢

回答

3

下面的注释代码在您的示例中产生结果。要了解它,你可能需要阅读以下内容:

验证码:

import sys 
import re 
from collections import defaultdict 
import itertools 

#define the pairs of words we want to compare 
compare = ['Homo_sapiens', 'Mus_musculus'] 

#define some regular expressions to split up the input data file 
#this searches for a digit, a colon, and matches the rest of the line 
group_re = re.compile("\n?\d+:.*\n") 
#this matches non-whitespace, followed by whitespace, and then non-whitespace, returning the two non-whitespace sections 
line_re = re.compile("(\S+)\s+(\S+)") 

#to store our resulting comparisons 
comparison = [] 

#open and read in the datafile 
datafile = open(sys.argv[1]).read() 
#use our regular expression to split the datafile into homolog groups 
for dataset in group_re.split(datafile): 
    #ignore empty matches 
    if dataset.strip()=='': continue 
    #split our group into lines 
    dataset = dataset.split('\n') 
    #use our regular expression to match each line, pulling out the two bits of data 
    dataset = [line_re.match(line).groups() for line in dataset if line.strip()!=''] 
    #build a dictionary to store our words 
    words = defaultdict(list) 
    #loop through our group dataset, grouping each line by its word 
    for v, k in dataset: words[k].append(v) 
    #add the results to our output list. Note here we are unpacking an argument list 
    comparison+=itertools.product(*[words[w] for w in compare]) 

#print out the words we wanted to compare 
print('\t'.join(compare)) 
#loop through our output dataset 
for combination in comparison: 
    #print each comparison, spaced with a tab character 
    print('\t'.join(combination)) 
1

它是一个两部分问题。首先将同源组提取出一个字典,然后遍历这些组并打印这些对。

#!/bin/python 
import re 
# Opens the text file 
with open("genes.txt","r") as f: 
    data = {} 
    # reads the lines 
    for line in f.readlines(): 
     # When there is a : at the line start -> new group 
     match = re.search("^([0-9]+):",line) 
     if match: 
      # extracts the group number and puts it to the dict 
      group = match.group(1) 
      # adds the species as entries with empty lists as values 
      data[str(group)] = { "Homo_sapiens":[] , "Mus_musculus":[]} 
     else: 
      # splits the line (also removes the \n) 
      text = line.replace("\n","").split() 
      # if the species is in the group, add the gene name to the list 
      if text[1] in data[group].keys(): 
       data[group][text[1]].append(text[0]) 
# Here you go with your parsed data 
print data 
# Now we feed it into the text format you want 
print "Homo_sapiens\t\tMus_musculus" 
# go through groups 
for gr in data: 
    # go through the Hs genes 
    for hs_gene in data[gr]["Homo_sapiens"]: 
     # get all the associated Ms genes 
     for ms_gene in data[gr]["Mus_musculus"]: 
      # print the pairs 
      print hs_gene+"\t\t"+ms_gene 

希望这会有所帮助。

+0

你不认为组数会超过9? – alexis

+0

好点。相应地解决了这个问题 – CDe

+0

s /'if match!= None:'/'if match:'/。你忘了放弃'group'的旧定义,所以你的代码仍然被破坏。 – alexis