1
我已经编写了一个脚本,用于比较多个文件并给出每个文件中每个段落的出现次数。该脚本对较小的文件工作正常,但是当应用于大文件时,程序停止输出。我需要一些帮助来修改脚本,以便它可以在所有文件上运行,即使它非常大。我的脚本:修改PERL脚本以更快的速度消耗更少的内存
#!/usr/bin/env perl
use strict;
use warnings;
no warnings qw(numeric);
my %seen;
$/ = "";
while (<>) {
chomp;
my ($key, $value) = split ('\t', $_);
my @lines = split /\n/, $key;
my $key1 = $lines[1];
$seen{$key1} //= [ $key ];
push (@{$seen{$key1}}, $value);
}
my $tot;
my $file_count = @ARGV;
while (my ($key1, $aref) = each %seen) {
$tot = 0;
for my $val (@{ $aref }) {
$tot += $val;
}
if (@{ $aref } >= $file_count) {
print join "\t", @{ $aref };
print "\tcount:". $tot."\n\n";
}
}
我正在提供用于更好地理解情况的示例文件。 data1.txt和data2.txt包含我与我有关的数据样本。我需要总结的所有文件读取如果每个的第二行读取匹配,即输出两个文件应该像显示在output.txt的发生:
**data1.txt**
@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :data1.txt
@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :data1.txt
@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :data1.txt
**data2.txt**
@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data2.txt
@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :data2.txt
@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :data2.txt
**output.txt**
@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :data1.txt 1 :data2.txt count:2
@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :data1.txt 2 :data2.txt count:5
@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :data1.txt 2 :data2.txt count:4
我试图把我的哈希一个文件但无法理解这个概念。如果有人能用一个简短的例子来解释解决方案,那将会非常有帮助。任何帮助将不胜感激。
通常你不能让你的程序运行得更快,同时使用更少的内存。 [搭配散列](https://metacpan.org/pod/DBM::Deep::Cookbook#PERFORMANCE)将对散列速度产生重大影响。 –
为什么你有'没有警告'数字'? – Borodin
@Borodin因为代码给出(超出其他警告)'参数“@ NS500278 AGAT ...”不是数字(x)加上(+)24行,<>块2. – PerlDuck