2014-09-19 55 views
-1

我写了一个PERL程序,该程序需要一个Excel工作表(通过将.xls扩展名改为.txt来转换为文本文件)以及一个用于输入的序列文件。 Excel工作表包含序列文件中某个区域的起始点和结束点(以及匹配区域任一侧的70个侧翼值),这些文件需要剪切并提取到第三个输出文件中。有300个值。程序读入每次需要切割的序列的起始点和结束点,但它重复告诉我,如果显然不是输入文件的长度,则该值超出了该长度。我只是不能似乎得到这个固定Perl程序错误

这是程序

use strict; 
use warnings; 

my $blast; 
my $i; 
my $idline; 
my $sequence; 
print "Enter Your BLAST result file name:\t"; 
chomp($blast = <STDIN>); # BLAST result file name 
print "\n"; 

my $database; 
print "Enter Your Gene list file name:\t"; 
chomp($database = <STDIN>); # sequence file 
print "\n"; 

open IN, "$blast" or die "Can not open file $blast: $!"; 

my @ids  =(); 
my @seq_start =(); 
my @seq_end =(); 

while (<IN>) { 

    #spliting the result file based on each tab 
    my @feilds = split("\t", $_); 
    push(@ids, $feilds[0]); #copying the name of sequence 
     #coping the 6th tab value of the result which is the start point of from where a value should be cut. 
    push(@seq_start, $feilds[6]); 
    #coping the 7th tab value of the result file which is the end point of a value should be cut. 
    push(@seq_end, $feilds[7]); 
} 
close IN; 

open OUT, ">Result.fasta" or die "Can not open file $database: $!"; 

for ($i = 0; $i <= $#ids; $i++) { 

    ($sequence) = &block($ids[$i]); 

    ($idline, $sequence) = split("\n", $sequence); 

    #extracting the sequence from the start point to the end point 
    my $seqlen = $seq_end[$i] - $seq_start[$i] - 1; 

    my $Nucleotides = substr($sequence, $seq_start[$i], $seqlen); #storing the extracted substring into $sequence 

    $Nucleotides =~ s/(.{1,60})/$1\n/gs; 

    print OUT "$idline\n"; 
    print OUT "$Nucleotides\n"; 
} 
print "\nExtraction Completed..."; 

sub block { 
    #block for id storage which is the first tab in the Blast output file. 
    my $id1 = shift; 
    print "$id1\n"; 
    my $start =(); 

    open IN3, "$database" or die "Can not open file $database: $!"; 

    my $blockseq = ""; 
    while (<IN3>) { 

     if (($_ =~ /^>/) && ($start)) { 

      last; 
     } 

     if (($_ !~ /^>/) && ($start)) { 

      chomp; 
      $blockseq .= $_; 
     } 

     if (/^>$id1/) { 

      my $start = $. - 1; 
      my $blockseq .= $_; 
     } 
    } 
    close IN3; 

    return ($blockseq); 
} 

BLAST结果文件:http://www.fileswap.com/dl/Ws7ehftejp/

序列文件:http://www.fileswap.com/dl/lPwuGh2oKM/

错误

SUBSTR之外字符串在Nucleotide_Extractor.pl第39行。

0在 Nucleotide_Extractor.pl线在Nucleotide_Extractor.pl线44 41.

使用未初始化值$核苷酸的级联(。)或串

使用未初始化值$核苷酸的置换(一个或多个///)。

任何帮助是非常赞赏和查询总是被邀请

+0

什么是phytophthora文件?没有它,我无法处理块功能。你看起来像substr(“Hello”,45,4)那样带有字符串长度以外的起始索引的子字符串。由于它不返回$核苷酸也未初始化。我建议你检查substr的索引。 – xtreak 2014-09-19 05:42:51

+0

@Wordzilla这是我在问题中提供的链接所使用的序列文件名。我已经将两个输入文件上传到fileswap并提供了链接。请下载这两个文件并进行处理。该序列属于名为Phytophthora的生物体。我现在改了文件名。谢谢 – 2014-09-19 05:56:01

+0

您还应该在脚本中使用strict,并使用'my'声明所有变量 - 即'my $ sequence = ...'。 – 2014-09-19 07:04:22

回答

2

有几个问题与现有的代码,我结束了重写剧本,而修复这些错误。您的实施效率不高,因为它会打开,读取并关闭Excel工作表中每个ID的序列文件。更好的方法是从序列文件中读取和存储数据,或者如果内存有限,请遍历序列文件中的每个条目并从Excel文件中选取相应的数据。你也可以更好地使用散列,而不是数组;哈希将数据存储在键 - 值对中,因此查找您要查找的内容更加容易。我也一直使用引用,因为它们可以很容易地将数据传入和传出子例程。

如果您不熟悉perl数据结构,请查看perlfaq4perldscperlreftut有关于使用引用的信息。

现有代码的主要问题是从fasta文件获取序列的子例程没有返回任何内容。在你的代码中放置大量的调试语句以确保它正在做你认为正在做的事情是一个好主意。我留在我的调试声明中,但将它们评论出来。我也大量地评论了我改变的代码。

#!/usr/bin/perl 
use strict; 
use warnings; 
# enables 'say', which prints out your text and adds a carriage return 
use feature ':5.10'; 
# a very useful module for dumping out data structures 
use Data::Dumper; 

#my $blast = 'infesmall.txt'; 
print "Enter Your BLAST result file name:\t"; 
chomp($blast = <STDIN>);  # BLAST result file name 
print "\n"; 

#my $database = 'infe.fasta'; 
print "Enter Your Gene list file name:\t"; 
chomp($database = <STDIN>); # sequence file 
print "\n"; 

open IN,"$blast" or die "Can not open file $blast: $!"; 

# instead of using three arrays, let's use a hash reference! 
# for each ID, we want to store the start and the end point. To do that, 
# we'll use a hash of hashes. The start and end information will be in one 
# hash reference: 
# { start => $fields[6], end => $fields[7] } 
# and we will use that hashref as the value in another hash, where the key is 
# the ID, $fields[0]. This means we can access the start or end data using 
# code like this: 
# $info->{$id}{start} 
# $info->{$id}{end} 
my $info; 

while(<IN>){ 
    #splitting the result file based on each tab 
    my @fields = split("\t",$_); 
    # add the data to our $info hashref with the ID as the key: 
    $info->{ $fields[0] } = { start => $fields[6], end => $fields[7] }; 
} 
close IN; 

#say "info: " . Dumper($info); 

# now read the sequence info from the fasta file 
my $sequence = read_sequences($database); 
#say "data from read_sequences:\n" . Dumper($sequence); 

my $out = 'result.fasta'; 
open(OUT, ">" . $out) or die "Can not open file $out: $!"; 

foreach my $id (keys %$info) { 

    # check whether the sequence exists 
    if ($sequence->{$id}) { 
     #extracting the sequence from the start point to the end point 
     my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1; 

     #say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end}; 

     #storing the extracted substring into $sequence 
     my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen); 
     $nucleotides =~ s/(.{1,60})/$1\n/gs; 
     #say "nucleotides: $nucleotides"; 
     print OUT $sequence->{$id}{header} . "\n"; 
     print OUT "$nucleotides\n"; 
    } 
} 
print "\nExtraction Completed..."; 

sub read_sequences { 
    # fasta file 
    my $fasta_file = shift; 

    open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!"; 

    # initialise two variables. We will store our sequence data in $fasta 
    # and use $id to track the current sequence ID 
    # the $fasta hash will look like this: 
    # $fasta = { 
    # 'gi|7212472|ref|NC_002387.2' => { 
    #  header => '>gi|7212472|ref|NC_002387.2| Phytophthora...', 
    #  seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...', 
    # } 
    #} 

    my ($fasta, $id); 

    while(<IN3>){ 
     chomp; 
     if (/^>/) { 
      if (/^>(\S+) /){ 
       # the header line with the sequence info. 
       $id = $1; 
       # save the data to the $fasta hash, keyed by seq ID 
       # we're going to build up an entry as we go along 
       # set the header to the current line 
       $fasta->{ $id }{ header } = $_; 
      } 
      else { 
       # no ID found! Erk. Emit an error and undef $id. 
       warn "Formatting error: $_"; 
       undef $id; 
      } 
     } 
     ## ensure we're getting sequence lines... 
     elsif (/^[ATGC]/) { 
      # if $id is not defined, there's something weird going on, so 
      # don't save the sequence. In a correctly-formatted file, this 
      # should not be an issue. 
      if ($id) { 
       # if $id is set, add the line to the sequence. 
       $fasta->{ $id }{ seq } .= $_; 
      } 
     } 
    } 
    close IN3; 
    return $fasta; 
} 
+0

赞赏...谢谢你的时间和麻烦。 – 2014-09-20 04:12:43