将FILE1值与FILE2范围进行比较并打印匹配

我对Perl很陌生，正在大学进行生物信息学项目。我已经FILE1包含位置的列表，格式为：将FILE1值与FILE2范围进行比较并打印匹配

和file2的格式为：

517 1878 forward 
700 2500 forward 
2156 3289 forward 
99000 100000 forward 
22000 23000 backward

我想在FILE1每个位置比较每一个范围在FILE2值，和如果一个位置落入其中一个范围，那么我想打印位置，范围和方向。

所以我期望的输出将是：

99269 99000 100000 forward 
550 517 1878 forward 
1700 517 1878 forward

目前，它会没有错误运行，但它不输出任何信息，所以我不确定我要去的地方错了！当我拆分最终的'if'规则时，它将运行，但只有在位置与范围完全相同的行上时才能工作。

我的代码如下：

#!/usr/bin/perl 

use strict; 
use warnings; 

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt"; 

open FILE1, "/Users/edwardtickle/Documents/CC22positions.txt" 
    or die "cannot open > CC22: $!"; 

open FILE2, "/Users/edwardtickle/Documents/CDSpositions.txt" 
    or die "cannot open > CDS: $!"; 

open(OUTPUTFILE, ">$outputfile") or die "Could not open output file: $! \n"; 

while (<FILE1>) { 
    if (/^(\d+)/) { 
     my $CC22 = $1; 

     while (<FILE2>) { 
      if (/^(\d+)\s+(\d+)\s+(\S+)/) { 
       my $CDS1 = $1; 
       my $CDS2 = $2; 
       my $CDS3 = $3; 

       if ($CC22 > $CDS1 && $CC22 < $CDS2) { 
        print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n"; 
       } 
      } 
     } 
    } 
} 

close(FILE1); 
close(FILE2);

我已经发布了same question on Perlmonks。

来源

2014-10-17 Ed Tickle

[在PerlMonks Crossposted]（http://www.perlmonks.org/ ？NODE_ID = 1104164）。 – choroba 2014-10-17 10:22:02

1700适合两个范围（'517 1878'和'700 2500'），但你只需要其中的一个。你选择那个标准是什么？ – TLP 2014-10-17 11:28:00

这里的数据由数据组成，范围实际上是基因组的片段，所以如果它匹配两次就可以，只要它找到一个范围即可！谢谢你指出，虽然。 – 2014-10-17 11:58:37

因为仅读取FILE2一旦它仅与FILE1

的第一线相比

后续线与关闭的文件

藏匿相比从FILE1中的阵列的行，然后比较每个线在FILE2每个数组项，如下图所示

#!/usr/bin/perl 

use strict; 
use warnings; 

my $outputfile = "out.txt"; 

open FILE1, "file1.txt" 
    or die "cannot open > CC22: $!"; 

open FILE2, "file2.txt" 
    or die "cannot open > CDS: $!"; 

open(OUTPUTFILE, ">$outputfile") or die "Could not open output file: $! \n"; 
my @file1list =(); 

while (<FILE1>) { 
    if (/^(\d+)/) { 
     push @file1list, $1; 
    } 
} 

while (<FILE2>) { 
    if (/^(\d+)\s+(\d+)\s+(\S+)/) { 
     my $CDS1 = $1; 
     my $CDS2 = $2; 
     my $CDS3 = $3; 

     for my $CC22 (@file1list) { 
      if ($CC22 > $CDS1 && $CC22 < $CDS2) { 
       print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n"; 
      } 
     } 
    } 
}

（也有与节目风格问题（如，但我忽略了这些大写字母变量），这是一个相当不错的计划对于一个初学者）

来源

2014-10-17 10:32:27 Vorsprung

完美的作品，谢谢你的即时回复！我没有足够的声望投票你的答案，但我会回来，一旦我有足够的时间做。出于兴趣，是否建议不要在变量中使用大写字母以避免区分大小写错误？ – 2014-10-17 11:37:54

快速浏览http://perldoc.perl.org/perlstyle.html所有大写变量名通常是perl自身使用的常量或内部变量 – Vorsprung 2014-10-17 13:01:41

我想我可以通过使用split而不是正则表达式来简化一些，但我认为我的代码实际上更长，更难以阅读！在任何情况下，请记住，分裂为这类问题的伟大工程：

# User config area 
my $positions_file = 'input_positions.txt'; 
my $ranges_file = 'input_ranges.txt'; 
my $output_file = 'output_data.txt'; 

# Reading data 
open my $positions_fh, "<", $positions_file; 
open my $ranges_fh, "<", $ranges_file; 
chomp(my @positions = <$positions_fh>); 
# Store the range data in an array containing hash tables 
my @range_data; 
# to be used like $range_data[0] = {start => $start, end => $end, dir => $dir} 
while (<$ranges_fh>) { 
    chomp; 
    my ($start, $end, $dir) = split; #splits $_ according to whitespace 
    push @range_data, { start => $start, end => $end, dir => $dir }; 
    #print "start: $start, end: $end, direction: $dir\n"; 
} #/while 
close $positions_fh; 
close $ranges_fh; 

# Data processing: 
open my $output_fh, ">", $output_file; 
#It feels like it should be more efficient to process one range at a time for all data points 
foreach my $range (@range_data) { #start one range at a time 
            #each $range = $range_data[#] = { hash table } 
    foreach my $position (@positions) { #check all positions 
     if (($range->{start} <= $position) and ($position <= $range->{end})) { 
      my $output_string = "$position " . $range->{start} . " " . $range->{end} . " " . $range->{dir} . "\n"; 
      print $output_fh $output_string; 
     }         #/if 
    } #/foreach position 
} #/foreach range 

close $output_fh;

该代码可能会运行得更快，如果while循环，它的阅读范围内的数据时做数据处理。

来源

2014-10-17 16:59:45

你的错误是因为你在嵌入文件处理，所以你的内循环只能经过文件的内容一次，然后卡在eof。

最简单的解决方案就是先将内部循环文件加载到内存中。

下面演示使用更多Modern Perl技术：

#!/usr/bin/perl 
use strict; 
use warnings; 
use autodie; 

my $cc22file = "/Users/edwardtickle/Documents/CC22positions.txt"; 
my $cdsfile = "/Users/edwardtickle/Documents/CDSpositions.txt"; 
my $outfile = "/Users/edwardtickle/Documents/CC22CDS.txt"; 

my @ranges = do { 
    # open my $fh, '<', $cdsfile; # Using Fake Data instead below 
    open my $fh, '<', \ "517 1878 forward\n700 2500 forward\n2156 3289 forward\n99000 100000 forward\n22000 23000 backward\n"; 
    map {[split]} <$fh>; 
}; 

# open my $infh, '<', $cc22file; # Using Fake Data instead below 
open my $infh, '<', \ "99269\n550\n100\n126477\n1700\n"; 

# open my $outfh, '>', $outfile; # Using STDOUT instead below 
my $outfh = \*STDOUT; 

CC22: 
while (my $cc22 = <$infh>) { 
    chomp $cc22; 

    for my $cds (@ranges) { 
     if ($cc22 > $cds->[0] && $cc22 < $cds->[1]) { 
      print $outfh "$cc22 @$cds\n"; 
      next CC22; 
     } 
    } 

    # warn "$cc22 No match found\n"; 
}

输出：

99269 99000 100000 forward 
550 517 1878 forward 
1700 517 1878 forward

Live Demo

来源

2014-10-17 23:23:27 Miller

将FILE1值与FILE2范围进行比较并打印匹配

回答

相关问题