2016-11-11 54 views
14

我要插入一个换行符在蛋白质序列,每10个字符:插入一个换行符字符串中的每10个字符使用朱莉娅

seq="MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ" 

在Perl中,这是很容易:

$seq=~s/(.{10})/$1\n/g ; # does the job! 

perl -e '$seq="MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ"; $seq=~s/(.{10})/$1\n/g; print $seq' 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 

在朱莉娅,

replace(seq, r"(.{10})" , "\n") 

不起作用,因为我不知道一种方式来获得捕获组({10})和substitu与本身忒它+“\ n”

julia> replace(seq, r"(.{10})" , "\n") 
"\n\n\n\n\n\n" 

因此,要做到这一点,我需要两个步骤:

julia> a=matchall(r"(.{1,10})" ,seq) 
    6-element Array{SubString{UTF8String},1}: 
    "MSKNKSPLLN" 
    "ESEKMMSEML" 
    "PMKVSQSKLN" 
    "YEEKVYIPTT" 
    "IRNRKQHCFR" 
    "RFFPYIALFQ" 

    julia> b=join(a, "\n") 
    "MSKNKSPLLN\nESEKMMSEML\nPMKVSQSKLN\nYEEKVYIPTT\nIRNRKQHCFR\nRFFPYIALFQ" 

    julia> println(b) 
    MSKNKSPLLN 
    ESEKMMSEML 
    PMKVSQSKLN 
    YEEKVYIPTT 
    IRNRKQHCFR 
    RFFPYIALFQ 

# Caution :  
a=matchall(r"(.{10})" ,seq) # wrong if seq is not exactly a multiple of 10 ! 

julia> seq 
"MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIAL" 

julia> matchall(r"(.{10})" ,seq) 
5-element Array{SubString{UTF8String},1}: 
"MSKNKSPLLN" 
"ESEKMMSEML" 
"PMKVSQSKLN" 
"YEEKVYIPTT" 
"IRNRKQHCFR" 

julia> matchall(r"(.{1,10})" ,seq) 
6-element Array{SubString{UTF8String},1}: 
"MSKNKSPLLN" 
"ESEKMMSEML" 
"PMKVSQSKLN" 
"YEEKVYIPTT" 
"IRNRKQHCFR" 
"RFFPYIAL" 

有没有一步到位的解决方案或更好的(更快)的方式?

只是为了有趣的基准与所有这些有趣的答案! (更新与朱莉娅5.0)

function loop(a) 
last = 0 
#create the interval, in your case 10 
salt = 10 
#iterate in string (starts in the 10th value, don't forget julia use 1 to first index) 
for i in salt:salt+1:length(a) 
    # replace the string for a new one with '\n' 
    a = string(a[1:i], '\n', a[i+1:length(a)]) 
    last = Int64(i) 
end 
# replace the rest 
a = string(a[1:length(a) - last % salt + 1], '\n', a[length(a) - last % salt + 2:length(a)]) 
println(a) 
end 

function regex1(seq) 
    a=matchall(r"(.{1,10})" ,seq) 
    b=join(a, "\n") 
    println(b) 
end 

function regex2(seq) 
    a=join(split(replace(seq, r"(.{10})", s"\1 ")), "\n") 
    println(a) 
end 

function regex3(seq) 
    a=replace(seq, r"(.{10})", Base.SubstitutionString("\\1\n")) 
    a= chomp(a) # because there is a new line at the end 
    println(a) 
end 

function intrapad(seq::String) 
    buf = IOBuffer((length(seq)*11)>>3) # big enough buffer 
    for i=1:10:length(seq) 
    write(buf,SubString(seq,i,i+9),'\n') 
    end 
    #return 
    print(takebuf_string(buf)) 
end 

function join_substring(seq) 
    a=join((SubString(seq,i,i+9) for i=1:10:length(seq)),'\n') 
    println(a) 
end 

seq="MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ" 

for i = 1:5 
    println("loop :") 
    @time loop(seq) 
    println("regex1 :") 
    @time regex1(seq) 
    println("regex2 :") 
    @time regex2(seq) 
    println("regex3 :") 
    @time regex3(seq) 
    println("intrapad :") 
    @time intrapad(seq) 
    println("join substring :") 
    @time join_substring(seq) 
end 

我改变基准来执行5次@time和我张贴在这里5执行@time的后的结果:

loop : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIA 
LFQ 
    0.000013 seconds (53 allocations: 3.359 KB) 
regex1 : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000013 seconds (49 allocations: 1.344 KB) 
regex2 : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000017 seconds (47 allocations: 1.703 KB) 
regex3 : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000013 seconds (31 allocations: 976 bytes) 
intrapad : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000007 seconds (9 allocations: 608 bytes) 
join substring : 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
    0.000012 seconds (21 allocations: 800 bytes) 

Intrapad现在第一;)

+3

不知道关于另一解决方案,但2个步骤可以变化到一个衬片是这样的:'SEQ = “MSKNKSPLLNESEKMMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ”;' '的println(合并(matchall(R,SEQ “({10})”。 ), “\ n”));' – AbhiNickz

+1

所以我检查了一遍文档: “{10}” '调用println(更换( “ABHISHEKBHASKERMSEMLPMKVSQSKLNYEEKVYIPTTIRNRKQHCFRRFFPYIALFQ”,R,S “一个\ g <0> SSS”));' 这儿如果我将sss替换为\ n这应该有效,但是根据文档“通过使用\ n来引用第n个捕获组”这是这里的问题。 – AbhiNickz

+0

是的,@AbhiNickz替换(seq,r“(。{10})”,s“\ g <0> \ n”)会产生一个错误,但是插入一个blanc是个很好的解决方案:replace(seq,r “(。{10})”,s“\ g <0>”)ok – Fred

回答

10

像@daycaster建议,你可以使用s"\1"作为替换字符串支持捕获组。问题在于特殊的s""字符串语法不支持特殊字符,如\n。您可以通过手动构建SubstitutionString对象解决这个问题,但你需要躲避\\1

julia> replace(seq, r"(.{10})", Base.SubstitutionString("\\1\n")) 
"MSKNKSPLLN\nESEKMMSEML\nPMKVSQSKLN\nYEEKVYIPTT\nIRNRKQHCFR\nRFFPYIALFQ\n" 
3

喜欢的东西:

julia> split(replace(seq, r"(.{10})", s"\1 ")) 
6-element Array{SubString{String},1}: 
"MSKNKSPLLN" 
"ESEKMMSEML" 
"PMKVSQSKLN" 
"YEEKVYIPTT" 
"IRNRKQHCFR" 
"RFFPYIALFQ" 

如果你想作为一个字符串,使用join()

julia> join(split(replace(seq, r"(.{10})", s"\1 ")), "\n") 
"MSKNKSPLLN\nESEKMMSEML\nPMKVSQSKLN\nYEEKVYIPTT\nIRNRKQHCFR\nRFFPYIALFQ" 

julia> println(ans) 
MSKNKSPLLN 
ESEKMMSEML 
PMKVSQSKLN 
YEEKVYIPTT 
IRNRKQHCFR 
RFFPYIALFQ 
+0

结果是一个数组,就像:matchall(r“(。{10})”,seq) – Fred

+0

对于包含空格的字符串,这将失败。 –

+1

@MattB。我的蛋白质序列包含空格?所以这就是为什么我总是饿... ...! – daycaster

3

我不知道你怎么可以用正则表达式做,但我认为它可以解决你的问题:

a = "oiaoueaoeuaoeuaoeuaoeuaoteuhasonetuhaonetuahounsaothunsaotuaosu" 
last = 0 
#create the interval, in your case 10 
salt = 10 
#iterate in string (starts in the 10th value, don't forget julia use 1 to first index) 
for i in salt:salt+1:length(a) 
    # replace the string for a new one with '\n' 
    a = string(a[1:i], '\n', a[i+1:length(a)]) 
    last = Int64(i) 
end 
# replace the rest 
a = string(a[1:length(a) - last % salt + 1], '\n', a[length(a) - last % salt + 2:length(a)]) 
println(a) 
+2

比Perl版本更具可读性:) – daycaster

+0

对于包含非ASCII字符的字符串,这将失败。 –

+0

@MattB。我该如何纠正它? – pmargreff

7

如果速度是一个问题,它可能是最好避免较重的工具,如正则表达式,并尝试就像这样:

function intrapad(seq::String) 
    buf = IOBuffer((length(seq)*11)>>3) # big enough buffer 
    for i=1:10:length(seq) 
    write(buf,SubString(seq,i,i+9),'\n') 
    end 
    return takebuf_string(buf) 
end 

速度来自使用IOBuffer和SubStrings最小化分配。使用BenchmarkTools软件包我们有:

julia> @benchmark intrapad(seq) 
BenchmarkTools.Trial: 
    memory estimate: 624.00 bytes 
    allocs estimate: 10 
    minimum time:  729.00 ns (0.00% GC) 
    median time:  767.00 ns (0.00% GC) 
    mean time:  862.99 ns (7.84% GC) 
    maximum time:  26.86 μs (96.21% GC) 

julia> @benchmark replace(seq, r"(.{10})", Base.SubstitutionString("\\1\n")) 
BenchmarkTools.Trial: 
    memory estimate: 720.00 bytes 
    allocs estimate: 26 
    minimum time:  2.18 μs (0.00% GC) 
    median time:  2.29 μs (0.00% GC) 
    mean time:  2.43 μs (3.85% GC) 
    maximum time:  531.31 μs (98.95% GC) 

只有2.5倍加速。 replace函数很好的实现!

另一种方式去无正则表达式是

join((SubString(seq,i,i+9) for i=1:10:length(seq)),'\n') 

这是不一样快(慢10倍,我的机器上没有内存分配点球),但可读性很强。

+1

这些函数仅适用于ASCII字符串,因为它们依赖字节每字符索引。但是在基因组序列之类的情况下,它应该没问题(或者在未来版本的Julia中,检查字符串) –

+0

最后的连接示例给出了以下错误:错误:LoadError:语法:元组中缺少分隔符 –

+0

这是一个版本问题。这个例子工作在0.5。你有0.4吗? –