2014-09-27 32 views
0

我对你所有的awk/sed/perl专家有个疑问。我遇到具有以下格式如文件:如何删除x个具有相同字符串的条目并只保留一个具有修改标题的条目?

>GALHOMG00000016026_1 GALHOMT00000016026_1 GALHOMP00000016026_1 JH556633.1:35740-45316 1 
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS 
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC 
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR 
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG 
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF 
SNLNLGRTYASGYAHYEEPED 

>HUMHOMG00000262990_1 HUMHOMT00000262990_1 HUMHOMP00000262990_1 JH556633.1:35740-45316 1 
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS 
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC 
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR 
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG 
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF 
SNLNLGRTYASGYAHYEEPED 

>TGUHOMG00000002432_1 TGUHOMT00000002432_1 TGUHOMP00000002432_1 JH556633.1:35740-45316 1 
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS 
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC 
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR 
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG 
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF 
SNLNLGRTYASGYAHYEEPED 

我想修改该文件到以下几点:

>JH556633.1:35740-45316 
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS 
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC 
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR 
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG 
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF 
SNLNLGRTYASGYAHYEEPED 

我知道我可以改变我所说的头(我的意思如下所示的行):

awk 'NF > 1{$0=">"$4}; {print $0}' file.fa > file2.fa 

我的问题是,我该如何删除其他两段?文件中可能存在段落中的字符序列(即,不包括标题行)不相同的实例。在这种情况下,我想根据具有相同标识符的条目数来附加扩展名(例如,在本例中,第一个JH556633.1-2:35740-45316JH556633.1-1:35740-45316,或类似的情况)。重点是使相同的标题(以>开头的行)不同,但如果它们不相同,则保留原始字符序列。

如果有人有想法解决这个问题,我将不胜感激的援助。谢谢!

+0

好吧!得到它了! – BashN3wb 2014-09-27 05:17:13

+0

你的意思是在'大于号'或'大于号'开头的行之后的行吗? – 2014-09-27 07:02:00

+0

请向我们展示您解决问题的尝试(不仅仅是您发布的awk命令,它只处理第一行)。 – 2014-09-27 07:28:59

回答

1

这应该适合你。它不依赖于不同序列之间的空行,因为并不是所有的fasta文件都有这些空行。它将_N添加到每个ID,其中N是发现ID的次数。仅与单个序列关联的ID将具有_1。如果一个ID与多个不同的序列相关联,则将打印所有这样的序列。

#!/usr/bin/env perl 
use strict; 
use warnings; 

## The field of the ID line you want to keep. 
## Since we start counting from 0, to get the 4th 
## field, set this to 3. 
my $want=3; 

my (@fields,%seqs,%seen,$seq); 
## Read the input file 
while (<>) { 
    ## Skip blank lines 
    next if /^\s*$/; 
    ## remove trailing newlines 
    chomp; 
    ## Is this an ID line? 
    if (/^\s*>(.*)/) { 
     ## Save the previous sequence (if any). The %seqs 
     ## hash has the sequence as a key and the desired 
     ## ID as a value. 
     if ($fields[0]) { 
      $seqs{$seq}=$fields[$want];     
      ## Clear the previous sequence and IDs 
      $seq=""; 
      @fields=(); 
     } 
     ## Split the ID fields into @fields. 
     @fields=split(/\s+/); 
    } 
    ## If this is a sequence, add to $seq 
    else { 
     $seq.=$_; 
    } 
} 
## Get the last sequence 
$seqs{$seq}=$fields[$want];     

foreach my $sequence (sort keys(%seqs)) { 
    ## Add an identifier. 
    $seen{$seqs{$sequence}}++; 
    print ">$seqs{$sequence}_$seen{$seqs{$sequence}}\n"; 
    ## Convert the sequence back to FASTA 
    $sequence=~s/(.{60})/$1\n/g; 
    print "$sequence\n"; 
} 

保存脚本为foo.pl也好,使其可执行chmod 744 foo.pl和运行:

$ ./foo.pl file.fa 
>JH556633.1:35740-45316_1 
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS 
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC 
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR 
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG 
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF 
SNLNLGRTYASGYAHYEEPED 
0

假设$4不能包含&\<digit>根据您发布的输入(如果它可以是一个简单的调整):

$ awk -v RS= '!seen[$4]++{sub(/[^\n]+/,$4);print}' file 
JH556633.1:35740-45316 
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS 
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC 
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR 
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG 
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF 
SNLNLGRTYASGYAHYEEPED 

它看起来就像你有一个问题也因此发布了一些有代表性的新问题输入和该问题的预期输出。

0
sed -n 's/^>\([^ ]\{1,\} \)\{3\}/>/;/^ *$/q;p' YourFile 

根据您的样品(POSIX版本的GNU所以--posix SED)

相关问题