2017-03-03 78 views
0

我需要一个函数来给出一个字符串列表最好与较大的字符串对齐的索引。对字符串列表的字符串列表的对齐索引

例如:

给出字符串:

text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.' 

和字符串列表:

tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.'] 

能函数来创建,以产生:

indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105] 


这里是一个脚本,我创建说明这一点:

from re import split 
from numpy import vstack, zeros 
import numpy as np 

# I need a function which takes a string and the tokenized list 
# and returns the indices for which the tokens were split at 
def index_of_split(text_str, list_of_strings): 
    #????? 
    return indices 

# The text string, string token list, and character binary annotations 
# are all given 
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.' 
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.'] 
# (This binary array labels the following terms ['Kir4.3', 'Dextran-sulfate', 'glucose']) 
bin_ann = [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] 

# Here we would apply our function 
indices = index_of_split(text, tok) 
# This list is the desired output 
#indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105] 

# We could now split the binary array based on these indices 
bin_ann_toked = np.split(bin_ann, indices) 
# and combine with the tokenized list 
tokenized_strings = np.vstack((tok, bin_ann_toked)).T 

# Then we can remove the trailing zeros, 
# which are likely caused from spaces, 
# or other non tokenized text 
for i, el in enumerate(tokenized_strings): 
    tokenized_strings[i][1] = el[1][:len(el[0])] 
print(tokenized_strings) 

提供以下输出,因为所描述的功能的工作:

[['Kir4.3' array([1, 1, 1, 1, 1, 1])] 
['is' array([0, 0])] 
['a' array([0])] 
['inwardly-rectifying' 
    array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])] 
['potassium' array([0, 0, 0, 0, 0, 0, 0, 0, 0])] 
['channel' array([0, 0, 0, 0, 0, 0, 0])] 
['.' array([0])] 
['Dextran-sulfate' array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])] 
['is' array([0, 0])] 
['useful' array([0, 0, 0, 0, 0, 0])] 
['in' array([0, 0])] 
['glucose' array([1, 1, 1, 1, 1, 1, 1])] 
['-' array([0])] 
['mediated' array([0, 0, 0, 0, 0, 0, 0, 0])] 
['channels' array([0, 0, 0, 0, 0, 0, 0, 0])] 
['.' array([0])]] 
+0

为什么“葡萄糖硫酸盐”在“葡萄糖”,“ - ”,“介导”不存在时保留连字符?如果我们不知道何时发生,那么函数将不得不被硬编码。 –

回答

1
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.' 

tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.'] 


ind = [0] 
for i,substring in enumerate(tok): 
    ind.append(text.find(substring,ind[i],len(text))) 

print ind[2:] 

结果

[7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105] 
+0

谢谢你的帮助,这是一个非常pythonic和简单的解决方案的问题。然而,使用暴力方法的解决方案非常有趣,值得研究其列表中字符串结构较少的任何人。 – chase

1

这是一个蛮力numpy方法:它找到所有单词匹配,然后评分所有组合惩罚偏移。

import numpy as np 
from scipy import signal 

def pen(l, r): 
    return (r-l)*(1-4*(l>r)) 

class template: 
    def __init__(self, template): 
     self.template = np.frombuffer(template.encode('utf32'), offset=4, 
             dtype=np.int32) 
     self.normalise = self.template*self.template 
    def match(self, other): 
     other = np.frombuffer(other.encode('utf32'), offset=4, dtype=np.int32)[::-1] 
     m = signal.convolve(self.template, other, 'valid') 
     t = signal.convolve(self.normalise, np.ones_like(other), 'valid') 
     delta = np.absolute(m - t) 
     md = min(delta) 
     return np.where(delta == md)[0], md 
    def brute(self, tok): 
     ms, md = self.match(tok[0]) 
     matches = [[-md, (tok[0], s, s+len(tok[0]))] for s in ms] 
     for t in tok[1:]: 
      ms, md = self.match(t) 
      matches = [[mo[0] - md - pen(mo[-1][-1], mn)] + mo[1:] 
         + [(t, mn, mn + len(t))] for mn in ms for mo in matches] 
     return sorted(matches, key=lambda x: x[0]) 
#   for t in tok[1:]: 
#    ms, md = self.match(t) 
#    matches = [[mo[0] - md] + mo[1:] 
#       + [(t, mn, mn + len(t))] for mn in ms for mo in matches 
#       if mo[-1][-1] <= mn] 
#   return sorted(matches, key=lambda x: x[0]) 

text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.' 
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.'] 
tx = template(text) 
matches = tx.brute(tok) 
print(matches[-1]) 

# [-11, ('Kir4.3', 0, 6), ('is', 7, 9), ('a', 10, 11), ('inwardly-rectifying', 12, 31), ('potassium', 32, 41), ('channel', 42, 49), ('.', 49, 50), ('Dextran-sulfate', 51, 66), ('is', 67, 69), ('useful', 70, 76), ('in', 77, 79), ('glucose', 80, 87), ('-', 87, 88), ('mediated', 88, 96), ('channels', 97, 105), ('.', 105, 106)] 
+0

这是非常有用的,但我还没有完全理解到底发生了什么。我注意到匹配的输出非常大。我将会执行这个任务数十万次,所以我希望它会很快。知道这些问题的顺序相同,它是否会解决问题?换句话说,序列应该进行类似的处理,'tok'和'text'字符串之间的连接文本之间只有少数字符丢失。 – chase

+0

@chase当然,我已经添加了一个乐观的版本(注释掉),只给出了一个结果,你将不得不试验你为真正的问题需要多少灵活性。 –

+0

非常感谢!这真是一个了不起的算法,我将来可能会使用它来理解DNA比对算法等。然而,对于这个特定的项目,简单的'text.find()'解决方案可以工作,并且它的速度比一个匹配解决方案它看起来像得分大致如下: 蛮力所有匹配:〜0.15秒 蛮力一个匹配:〜0.001秒 text.find:〜0.00001秒(比蛮力所有匹配快〜10000倍) – chase