2017-03-31 93 views
0

好了,所以生病得到开门见山这里是我的代码Python字符串分裂与多重分割点

def digestfragmentwithenzyme(seqs, enzymes): 

fragment = [] 
for seq in seqs: 
    for enzyme in enzymes: 
     results = [] 
     prog = re.compile(enzyme[0]) 
     for dingen in prog.finditer(seq): 
      results.append(dingen.start() + enzyme[1]) 
     results.reverse() 
     #result = 0 
     for result in results: 
      fragment.append(seq[result:]) 
      seq = seq[:result] 
     fragment.append(seq[:result]) 
fragment.reverse() 
return fragment 

输入此功能是多串(SEQ)例如列表:

List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 

和酶作为输入:

[["TC", 1],["GC",1]] 

(注:可以有多个给出,但他们大多是在这个问题上的字母与ATCG)

该函数返回一个列表,在这个例子中,包含2个列表:

Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]] 

现在我有麻烦了splitti将其重复两次并获得正确的输出。

有关该功能的更多信息。它通过字符串(seq)查看识别点。在这种情况下,TC或GC将其分解到酶的第二个指标上。它应该为两个酶的列表中的两个字符串做到这一点。

+0

这可能有助于详细说明“正确的输出”究竟是什么。如果你的程序没有做到你想要的,那么它将不会帮助我们的读者理解输入序列,酶列表和输出列表之间的关系究竟是什么。很明显,它不仅仅是一个简单的子查询。 – Risadinha

+0

对于初学者来说'prog'是一个正则表达式,应该对一个字符串进行操作,而'seq'是一个字符串列表,所以'prog.finditer(seq)'是一个错误。您需要一次处理一个输入字符串。 –

+0

@AlexHall是的,我试了seqs中的seq(在参数aswel中改变它),但它没有给我正确的输出 –

回答

1

假设我们的想法是在每个酶上分裂,在酶的多个字母的指数点,分裂,本质上来自两个字母之间。不需要正则表达式。

您可以通过查找出现位置并在正确的索引处插入拆分指示符,然后后处理结果以实际拆分来完成此操作。

例如:

def digestfragmentwithenzyme(seqs, enzymes): 
    # preprocess enzymes once, then apply to each sequence 
    replacements = [] 
    for enzyme in enzymes: 
     replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:])) 
    result = [] 
    for seq in seqs: 
     for r in replacements: 
      seq = seq.replace(r[0], r[1]) # So AATTC becomes AATT|C 
     result.append(seq.split('|'))  # So AATT|C becomes AATT, C 
    return result 

def test(): 
    seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
    enzymes = [["TC", 1],["GC",1]] 
    print digestfragmentwithenzyme(seqs, enzymes) 
+0

不,酶的长度可能超过2个字母,索引可能大于或小于2.它可以是0-5的任何值,字母没有最小或最大长度 –

+0

因此,对于酶[ 'AAT',2],那么'AATACCG'变成'AA','TACCG',但对于['AAT',1]则是'A','AATCCG'? – pbuck

+0

是的,但['AAT',1]会变成['A','ATCCG'] –

1

这里是我的解决方案:

更换TCT CGCG C(这是基于指数给出完成),然后根据空间性格分裂... 。

def digest(seqs, enzymes): 
    res = [] 
    for li in seqs: 
     for en in enzymes: 
      li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:]) 
     r = li.split() 
     res.append(r) 
    return res 
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
enzymes = [["TC", 1],["GC",1]] 
#enzymes = [["AAT", 2],["GC",1]] 
print seqs 
print digest(seqs, enzymes) 

结果是:

([["TC", 1],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC'] 
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA 
AAAAT', 'C']] 

([["AAT", 2],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC'] 
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', ' 
TC']] 
0

这是应该的工作使用正则表达式。在这个解决方案中,我发现你的酶串的所有事件,并使用它们相应的索引进行分割。

def digestfragmentwithenzyme(seqs, enzymes): 
    out = [] 
    dic = dict(enzymes) # dictionary of enzyme indices 

    for seq in seqs: 
     sub = [] 
     pos1 = 0 

     enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case 
     for match in re.finditer('('+enzstr+')', seq): 
      index = dic[match.group(0)] 
      pos2 = match.start()+index 
      sub.append(seq[pos1:pos2]) 
      pos1 = pos2 
     sub.append(seq[pos1:]) 
     out.append(sub) 
     # [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']] 
    return out 
+0

我喜欢你的,但是有什么办法可以让它使用1种酶而不是总是需要2种或更多?也许有:如果酶> 1: –

+0

@NathanWeesie据我所知,它已经与1酶...一起工作...你为什么说代码需要2个或更多? –

0

使用正回顾后发和前瞻的正则表达式搜索:

import re 


def digest_fragment_with_enzyme(sequences, enzymes): 
    pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes) 
    print pattern # prints ((?<=T)(?=C))|((?<=G)(?=C)) 
    for seq in sequences: 
     indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)] 
     yield [seq[start: end] for start, end in zip(indices, indices[1:])] 

seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"] 
enzymes = [["TC", 1], ["GC", 1]] 
print list(digest_fragment_with_enzyme(seq, enzymes)) 

输出:

[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], 
['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']] 
0

我能想到的最简单的回答:

input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
enzymes = ['TC', 'GC'] 
output = [] 
for string in input_list: 
    parts = [] 
    left = 0 
    for right in range(1,len(string)): 
     if string[right-1:right+1] in enzymes: 
      parts.append(string[left:right]) 
      left = right 
    parts.append(string[left:]) 
    output.append(parts) 
print(output) 
0

在这里把我的帽子扔在戒指里。

  • 使用字典而不是列表的列表。
  • 像其他人一样加入模式以避免花哨的正则表达式。

import re 

sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"] 
patterns = { 'TC': 1, 'GC': 1 } 

def intervals(patterns, text): 
    pattern = '|'.join(patterns.keys()) 
    start = 0 
    for match in re.finditer(pattern, text): 
    index = match.start() + patterns.get(match.group()) 
    yield text[start:index] 
    start = index 
    yield text[index:len(text)] 

print [list(intervals(patterns, s)) for s in sequences] 

# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]