2014-10-17 198 views
2

我对Perl很陌生,正在大学进行生物信息学项目。我已经FILE1包含位置的列表,格式为:将FILE1值与FILE2范围进行比较并打印匹配

99269 
550 
100 
126477 
1700 

和file2的格式为:

517 1878 forward 
700 2500 forward 
2156 3289 forward 
99000 100000 forward 
22000 23000 backward 

我想在FILE1每个位置比较每一个范围在FILE2值,和如果一个位置落入其中一个范围,那么我想打印位置,范围和方向。

所以我期望的输出将是:

99269 99000 100000 forward 
550 517 1878 forward 
1700 517 1878 forward 

目前,它会没有错误运行,但它不输出任何信息,所以我不确定我要去的地方错了!当我拆分最终的'if'规则时,它将运行,但只有在位置与范围完全相同的行上时才能工作。

我的代码如下:

#!/usr/bin/perl 

use strict; 
use warnings; 

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt"; 

open FILE1, "/Users/edwardtickle/Documents/CC22positions.txt" 
    or die "cannot open > CC22: $!"; 

open FILE2, "/Users/edwardtickle/Documents/CDSpositions.txt" 
    or die "cannot open > CDS: $!"; 

open(OUTPUTFILE, ">$outputfile") or die "Could not open output file: $! \n"; 

while (<FILE1>) { 
    if (/^(\d+)/) { 
     my $CC22 = $1; 

     while (<FILE2>) { 
      if (/^(\d+)\s+(\d+)\s+(\S+)/) { 
       my $CDS1 = $1; 
       my $CDS2 = $2; 
       my $CDS3 = $3; 

       if ($CC22 > $CDS1 && $CC22 < $CDS2) { 
        print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n"; 
       } 
      } 
     } 
    } 
} 

close(FILE1); 
close(FILE2); 

我已经发布了same question on Perlmonks

+1

[在PerlMonks Crossposted](http://www.perlmonks.org/ ?NODE_ID = 1104164)。 – choroba 2014-10-17 10:22:02

+2

1700适合两个范围('517 1878'和'700 2500'),但你只需要其中的一个。你选择那个标准是什么? – TLP 2014-10-17 11:28:00

+1

这里的数据由数据组成,范围实际上是基因组的片段,所以如果它匹配两次就可以,只要它找到一个范围即可!谢谢你指出,虽然。 – 2014-10-17 11:58:37

回答

2

因为仅读取FILE2一旦它仅与FILE1

的第一线相比

后续线与关闭的文件

藏匿相比从FILE1中的阵列的行,然后比较每个线在FILE2每个数组项,如下图所示

#!/usr/bin/perl 

use strict; 
use warnings; 

my $outputfile = "out.txt"; 

open FILE1, "file1.txt" 
    or die "cannot open > CC22: $!"; 

open FILE2, "file2.txt" 
    or die "cannot open > CDS: $!"; 

open(OUTPUTFILE, ">$outputfile") or die "Could not open output file: $! \n"; 
my @file1list =(); 

while (<FILE1>) { 
    if (/^(\d+)/) { 
     push @file1list, $1; 
    } 
} 

while (<FILE2>) { 
    if (/^(\d+)\s+(\d+)\s+(\S+)/) { 
     my $CDS1 = $1; 
     my $CDS2 = $2; 
     my $CDS3 = $3; 

     for my $CC22 (@file1list) { 
      if ($CC22 > $CDS1 && $CC22 < $CDS2) { 
       print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n"; 
      } 
     } 
    } 
} 

(也有与节目风格问题(如,但我忽略了这些大写字母变量),这是一个相当不错的计划对于一个初学者)

+0

完美的作品,谢谢你的即时回复!我没有足够的声望投票你的答案,但我会回来,一旦我有足够的时间做。出于兴趣,是否建议不要在变量中使用大写字母以避免区分大小写错误? – 2014-10-17 11:37:54

+0

快速浏览http://perldoc.perl.org/perlstyle.html所有大写变量名通常是perl自身使用的常量或内部变量 – Vorsprung 2014-10-17 13:01:41

0

我想我可以通过使用split而不是正则表达式来简化一些,但我认为我的代码实际上更长,更难以阅读!在任何情况下,请记住,分裂为这类问题的伟大工程:

# User config area 
my $positions_file = 'input_positions.txt'; 
my $ranges_file = 'input_ranges.txt'; 
my $output_file = 'output_data.txt'; 

# Reading data 
open my $positions_fh, "<", $positions_file; 
open my $ranges_fh, "<", $ranges_file; 
chomp(my @positions = <$positions_fh>); 
# Store the range data in an array containing hash tables 
my @range_data; 
# to be used like $range_data[0] = {start => $start, end => $end, dir => $dir} 
while (<$ranges_fh>) { 
    chomp; 
    my ($start, $end, $dir) = split; #splits $_ according to whitespace 
    push @range_data, { start => $start, end => $end, dir => $dir }; 
    #print "start: $start, end: $end, direction: $dir\n"; 
} #/while 
close $positions_fh; 
close $ranges_fh; 

# Data processing: 
open my $output_fh, ">", $output_file; 
#It feels like it should be more efficient to process one range at a time for all data points 
foreach my $range (@range_data) { #start one range at a time 
            #each $range = $range_data[#] = { hash table } 
    foreach my $position (@positions) { #check all positions 
     if (($range->{start} <= $position) and ($position <= $range->{end})) { 
      my $output_string = "$position " . $range->{start} . " " . $range->{end} . " " . $range->{dir} . "\n"; 
      print $output_fh $output_string; 
     }         #/if 
    } #/foreach position 
} #/foreach range 

close $output_fh; 

该代码可能会运行得更快,如果while循环,它的阅读范围内的数据时做数据处理。

0

你的错误是因为你在嵌入文件处理,所以你的内循环只能经过文件的内容一次,然后卡在eof

最简单的解决方案就是先将内部循环文件加载到内存中。

下面演示使用更多Modern Perl技术:

#!/usr/bin/perl 
use strict; 
use warnings; 
use autodie; 

my $cc22file = "/Users/edwardtickle/Documents/CC22positions.txt"; 
my $cdsfile = "/Users/edwardtickle/Documents/CDSpositions.txt"; 
my $outfile = "/Users/edwardtickle/Documents/CC22CDS.txt"; 

my @ranges = do { 
    # open my $fh, '<', $cdsfile; # Using Fake Data instead below 
    open my $fh, '<', \ "517 1878 forward\n700 2500 forward\n2156 3289 forward\n99000 100000 forward\n22000 23000 backward\n"; 
    map {[split]} <$fh>; 
}; 

# open my $infh, '<', $cc22file; # Using Fake Data instead below 
open my $infh, '<', \ "99269\n550\n100\n126477\n1700\n"; 

# open my $outfh, '>', $outfile; # Using STDOUT instead below 
my $outfh = \*STDOUT; 

CC22: 
while (my $cc22 = <$infh>) { 
    chomp $cc22; 

    for my $cds (@ranges) { 
     if ($cc22 > $cds->[0] && $cc22 < $cds->[1]) { 
      print $outfh "$cc22 @$cds\n"; 
      next CC22; 
     } 
    } 

    # warn "$cc22 No match found\n"; 
} 

输出:

99269 99000 100000 forward 
550 517 1878 forward 
1700 517 1878 forward 

Live Demo

相关问题