2014-03-07 37 views
0

我有一个文件表征的基因组区域,看起来像这样:提取重叠区域

chrom chromStart chromEnd PGB 
chr1 12874 28371 2 
chr1 15765 21765 1 
chr1 15795 28371 2 
chr1 18759 24759 1 
chr1 28370 34961 1 
chr3 233278 240325 1 
chr3 239279 440831 2 
chr3 356365 362365 1 

基本上PGB,其特征为它的染色体数目(CHROM)的基因组区域的类别,启动(chromStart)和结束( chromEnd)坐标。

我希望以折叠重叠区域,使得重叠PGB的区域= 1和2是在一个新的类别,PGB = 3输出端:

chrom chromStart chromEnd PGB 
chr1 12874 15764 2 
chr1 15765 24759 3 
chr1 24760 28369 2 
chr1 28370 28371 3 
chr1 28372 34961 1 
chr3 233278 239278 1 
chr3 239279 240325 3 
chr3 240326 356364 2 
chr3 356365 440831 3 

基本上我希望获得一个输出文件,其报告独特的地区。有两个标准。

首先,如果PGB(第4列)在行之间相同,则合并范围。例如。

chrom chromStart chromEnd PGB 
chr1 1 10 1 
chr1 5 15 1 

输出

chrom chromStart chromEnd PGB 
chr1 1 15 1 

其次,如果PGB是行之间不同,CHR(列1)是相同的,并且范围重叠(COL2和3)中,报告重叠范围为PGB = 3作为以及各个类别独有的范围。

例如。

chrom chromStart chromEnd PGB 
chr1 30 100 1 
chr1 50 150 2 

输出

chrom chromStart chromEnd PGB 
chr1 30 49 1 
chr1 50 100 3 
chr1 101 150 2 

我希望能说明问题更好。

+0

到目前为止你有尝试过什么吗? – chilemagic

+0

我对perl/unix非常陌生,所以我在excel上手动执行。不幸的是,我有60000多行,所以我希望能有更快的选择。 – user3222627

+0

@ user3222627你需要多解释一下你如何得到你想要的结果。 –

回答

1

我已经创建了一个脚本,我相信可以实现此目标。

use List::Util qw(max); 

use strict; 
use warnings; 

# Skip Header row 
<DATA>; 

# With only 60k rows, going to just slurp all the data 
my %chroms; 
while (<DATA>) { 
    chomp; 
    my ($chrom, $start, $end, $pgb) = split ' '; 

    # Basic Data Validation. 
    warn "chromStart is NaN, '$start'" if $start =~ /\D/; 
    warn "chromEnd is NaN, '$end'" if $end =~ /\D/; 
    warn "range not ordered, '$start' to '$end'" if $start > $end; 
    warn "PGB only can be 1 or 2, '$pgb'" if ($pgb ne '1' && $pgb ne '2'); 

    push @{$chroms{$chrom}{$pgb}}, [$start, $end]; 
} 

print "chrom chromStart chromEnd PGB\n"; 

# Process each Chrom 
for my $chrom (sort keys %chroms) { 
    for my $pgb (sort keys %{$chroms{$chrom}}) { 
     my $ranges = $chroms{$chrom}{$pgb}; 

     # Sort Ranges 
     @$ranges = sort {$a->[0] <=> $b->[0] || $a->[1] <=> $b->[1]} @$ranges; 

     # Combine overlapping and continguous ranges. 
     # - Note because we're dealing with integer ranges, 1-4 & 5-8 are contiguous 
     for my $i (0..$#$ranges-1) { 
      if ($ranges->[$i][1] >= $ranges->[$i+1][0] - 1) { 
       $ranges->[$i+1][0] = $ranges->[$i][0]; 
       $ranges->[$i+1][1] = max($ranges->[$i][1], $ranges->[$i+1][1]); 
       $ranges->[$i] = undef; 
      } 
     } 
     @$ranges = grep {$_} @$ranges; 
    } 

    # Create pgb=3 for overlaps. 
    # - Save old ranges into aliases, and then start fresh 
    my $pgb1array = $chroms{$chrom}{1}; 
    my $pgb2array = $chroms{$chrom}{2}; 
    my @ranges; 

    # Always working on the first range in each array, until one of the arrays is empty 
    while (@$pgb1array && @$pgb2array) { 
     # Aliases to first element 
     my $pgb1 = $pgb1array->[0]; 
     my $pgb2 = $pgb2array->[0]; 

     # PGB1 < PGB2 
     if ($pgb1->[1] < $pgb2->[0]) { 
      push @ranges, [@{shift @$pgb1array}, 1] 

     # PGB2 < PGB1 
     } elsif ($pgb2->[1] < $pgb1->[0]) { 
      push @ranges, [@{shift @$pgb2array}, 2] 

     # There's overlap for all rest 
     } else { 
      # PGB1start < PGB2start 
      if ($pgb1->[0] < $pgb2->[0]) { 
       push @ranges, [$pgb1->[0], $pgb2->[0]-1, 1]; 
       $pgb1->[0] = $pgb2->[0]; 

      # PGB2start < PGB1start 
      } elsif ($pgb2->[0] < $pgb1->[0]) { 
       push @ranges, [$pgb2->[0], $pgb1->[0]-1, 2]; 
       $pgb2->[0] = $pgb1->[0]; 
      } 
      # (Starts are equal now) 

      # PGB1end < PGB2end 
      if ($pgb1->[1] < $pgb2->[1]) { 
       $pgb2->[0] = $pgb1->[1] + 1; 
       push @ranges, [@{shift @$pgb1array}, 3]; 

      # PGB2end < PGB1end 
      } elsif ($pgb2->[1] < $pgb1->[1]) { 
       $pgb1->[0] = $pgb2->[1] + 1; 
       push @ranges, [@{shift @$pgb2array}, 3]; 

      # PGB2end = PGB1end 
      } else { 
       push @ranges, [@$pgb1, 3]; 
       shift @$pgb1array; 
       shift @$pgb2array; 
      } 
     } 
    } 

    # Append whichever is left over 
    push @ranges, map {$_->[2] = 1; $_} @$pgb1array; 
    push @ranges, map {$_->[2] = 2; $_} @$pgb2array; 

    printf "%-8s %-11s %-11s %s\n", $chrom, @$_ for (@ranges); 
} 

1; 

__DATA__ 
chrom chromStart chromEnd PGB 
chr1 12871 12873 2 
chr1 12874 28371 2 
chr1 15765 21765 1 
chr1 15795 28371 2 
chr1 18759 24759 1 
chr1 28370 34961 1 
chr3 233278 240325 1 
chr3 239279 440831 2 
chr3 356365 362365 1 

而结果:

chrom chromStart chromEnd PGB 
chr1  12871  15764  2 
chr1  15765  24759  3 
chr1  24760  28369  2 
chr1  28370  28371  3 
chr1  28372  34961  1 
chr3  233278  239278  1 
chr3  239279  240325  3 
chr3  240326  356364  2 
chr3  356365  362365  3 
chr3  362366  440831  2 

注:我创造了这个主要是为锻炼自己,但如果你以任何方式使用它,请让我知道。

+0

为连续范围添加了小修正。应该加入1-4和5-8,因为我们只处理整数范围。 – Miller