2012-07-11 53 views
1

我是新的perl只是尝试与小杂乱的代码。我怎样才能合并和处理多个行从一个文件使用Perl产生一个报告

猫input1.txt

##gff-version 2 
    ##source-version geneious 5.6.4 
    Xm_ABL1 Geneious CDS 1 168 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious CDS 169 334 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious CDS 335 628 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious CDS 629 901 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious CDS 902 985 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious CDS 986 1165 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious CDS 1166 1350 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious CDS 1351 1504 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
    Xm_ABL1 Geneious BLAST Hit 169 334 . + . 
    Xm_ABL1 Geneious extracted region 1 168 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="351297 -> 351464" 
    Xm_ABL1 Geneious extracted region 169 334 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="371785 -> 371950" 
    Xm_ABL1 Geneious extracted region 335 628 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="372554 -> 372847" 
    Xm_ABL1 Geneious extracted region 629 901 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="374760 -> 375032" 
    Xm_ABL1 Geneious extracted region 902 985 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375230 -> 375313" 
    Xm_ABL1 Geneious extracted region 986 1165 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375992 -> 376171" 
    Xm_ABL1 Geneious extracted region 1166 1350 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376575 -> 376759" 
    Xm_ABL1 Geneious extracted region 1351 1504 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376914 -> 377067" 

如果输入文件方含( - >)向前arrow.I想像 输出,如果($阵列[7] =〜/.*间隔= \“\ d + - > \ d + \“$/gm){$ array [5] =”+“; }

猫output1.txt

gi_371443098_gb_JH556762.1 gene 351297 377067 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 351297 351464 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 371785 371950 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 372554 372847 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 374760 375032 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 375230 375313 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 375992 376171 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 376575 376759 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 376914 377067 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
### 

猫output1.txt 如果输入文件方含(< - )反向箭头。 if($ array [7] =〜/.* interval = \“\ d + < - \ d + \”$/gm){$ array [5] =“ - ”; }

gi_371443098_gb_JH556762.1 gene 351297 377067 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 351297 351464 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 371785 371950 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 372554 372847 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 374760 375032 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 375230 375313 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 375992 376171 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 376575 376759 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
gi_371443098_gb_JH556762.1 CDS 376914 377067 . - . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 
### 

我已经尝试了一些小杂乱的代码,因为我是初学者。

#usr/bin/perl; 
use strict; 
open(FH,"$ARGV[0]"); 

while(<FH>){ 
chomp $_; 
my @array=split("\t"); 
my $key="$array[2]-$array[0]-$array[1]-$array[2]-$array[3]"; 
if($array[1] eq "CDS"){ 
$cds_cnt{$key}++; 
$cds{$key}="$array[4]\t$array[5]\t$array[6]\t$array[7]"; 
    } 
if($array[1] eq "extracted region"){ 
    (my $pos1,my $pos2)=($array[7]=~/.*interval=\"(\d+) -> (\d+)\"$/gm); 
$extract_cnt{$key}++; 
$extract{$key}="$pos1\t$pos2"; 
      } 
} 

foreach $i ( sort {$a<=>$b} keys %cds){ 
my $a=$i; #print "$i\n"; 
$a=~s/CDS/extracted region/g; 
if($cds_cnt{$i} == $extract_cnt{$a}){ 
#print "$i\t$cds{$i}\n$a\t$extract{$a}\n"; 
my @array=split /\-/,$i; 
my @pos=split "\t",$extract{$a}; 
print "$array[1]\t$array[2]\t$pos[0]\t$pos[1]\t$cds{$i}\n"; 
    } 
    } 
print "###"; 

更新

我需要在我的代码修改什么

1.To从提取的区域(即阵列[7] =/GI的行获得价值| 371443098 | GB | JH556762.1 | /)它可以是任何值,为其添加下划线(即gi_371443098_gb_JH556762.1)并在output1.txt中的数组[0]中打印,如图所示。

2.添加新行作为第一行打印时(gi_371443098_gb_JH556762.1基因),第3列中得到起始CDS(即351297)的值,并获得在第4栏(即377067)结束CDS的值,并打印在第一行如ouput1.txt所示

3.如果/提取的区域/块的所有行为.egExtracted interval =“351297 - > 351464”(即向前箭头)打印数组[5]为“+”符号包括输出中的基因头。如果例如提取间隔=“351297 <-351464”(反向箭头)将阵列[5]打印为包括输出中基因标题的“ - ”符号。

+1

随时用pragma严格! – 2012-07-11 22:33:39

回答

2

它看起来像你想达到什么是从标线合并的细节CDS与匹配的行标提取的区域,然后用领先的总结头打印合并结果的基础上,一些最小值和最大值,按分组,名称。那是对的吗?

我打算假设你所说的$ array [0](Xm_ABL1 Geneious)和$ array [2](169,335等)就足以将它们结合在一起,但在你的例子中这不是很清楚。

你的第一个问题只是一个正则表达式,我认为你有一般的问题。我认为问题在于如何捕捉数据。

要做第二件事你问,在你的第一遍捕获喜和LO值,并将它们存储。

我没打算写一个完整的解决方案,但在这里它是...

use strict; 
use warnings; 

my $metadata = {}; # hashref to store CDS info in.. 
my $group = {}; # hashref to store summary/detail in.. 
my $arrow = { "->" => '+', "<-" => '-' }; # decode arrow to pos/neg 

open(FH,"$ARGV[0]"); 
while(<FH>){ 
    chomp; 
    next if /^#/; 
    my @array=split("\t"); 
    my $key = join(":", $array[0], $array[2]); 
    if ($array[1] =~ /CDS/){ 
     $metadata->{$key} = $array[7]; 
    } 
    if ($array[1] =~ /extracted region/){ 
     #assert CDS already processed.. 
     die "No CDS record for $key!\n" unless $metadata->{$key}; 
     (my $label = $array[7]) =~ s/.*region from (.*)\|;.*/$1/; 
     $label =~ s/\|/_/g; 
     $group->{$label} ||= { #seed summary if not exists 
       pos1 => 1e10, 
       pos2 => 0, 
       metadata => $metadata->{$key}, 
       sequences => [], 
     }; 
     (my $pos1, my $arr, my $pos2) = ($array[7]=~/.*interval=\"(\d+) (<?->?) (\d+)\"$/gm); 
     # capture hi/lo values for group 
     $group->{$label}->{pos1} = $pos1 if $pos1 < $group->{$label}->{pos1}; 
     $group->{$label}->{pos2} = $pos2 if $pos2 > $group->{$label}->{pos2}; 
     # push this sequence onto the group's array 
     push(@{ $group->{$label}->{sequences} }, [ $pos1, $pos2, $arrow->{$arr} ]); 
    } 
} 
for my $gene (sort keys %{ $group }){ 
    #write out header 
    printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n", 
     $gene, 'gene', 
     $group->{$gene}->{pos1}, $group->{$gene}->{pos2}, 
     $group->{$gene}->{sequences}->[0]->[2], 
     $group->{$gene}->{metadata}; 
    foreach my $sequence (@{ $group->{$gene}->{sequences} }){ 
     # write out details 
     printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n", 
      $gene, 'CDS', 
      $sequence->[0], $sequence->[1], $sequence->[2], 
      $group->{$gene}->{metadata}; 
    } 
} 
print "###\n"; 

我希望这是足够的评论是有道理的。像这样书写,如果你必须在六个月过去之后回来修改它,那么这将是一个lot更容易维护。

UPDATE

我修改这个代码第四评论如下。 $ array [7] regexp块现在捕获$ array [5]的值并将其存储在序列arrayref中。

更新#2

  • 注意使用$箭头hashref到-><-解码到您所需要的符号。 (第6行)
  • 基因头显示+或 - 基于第一个序列的第三个字段中的值。有一个假设,即一个基因的所有序列都有相同的方向。 (从末11线)

我认为我们已经偏离过将q &一条公路,并为自由软件发展局现。我写的不是复杂的代码,它有评论和结构。现在是您来对付它的逻辑的时候了。并且提出我的答案。

+0

我使用的是perl 5版本,所以添加到代码中:no warnings'uninitialized'; – jack79 2012-07-12 14:40:31

+0

它是用perl5来编写的。当我用示例数据运行它时,我没有收到任何警告,因此可能全部数据的不一致性都不明显。我会建议你应该“治愈原因,而不是治疗效果”。修改代码以允许仅在必要时使用未初始化的值,方法是使用类似于“if(($ possible_null_variable || 0)> 10)”的方法。另外,如果我的答案解决了您的问题,那么礼仪就是将其标记为“正确”答案和/或提升它,这样我就可以获得功劳。 – RET 2012-07-13 02:28:58

+0

只是为了强化之前的评论,'没有警告'未初始化';'是一个非常糟糕的主意。在写perl近20年的时间里,我只用了很多次这个编译指示,并且只用了代码块或子代的本地范围,从来没有作为全局开关。该计划告诉你“这里发生了一些意想不到的事情,或者你没有允许的事情”。所以注意它!如果汽车仪表盘上的油灯亮起,您会检查机油,还是只将灯泡取出?这是一样的原则。 – RET 2012-07-13 02:48:49

相关问题