Perl中查找数据的速度更快

我用这样的代码来寻找数据值，我算了一笔账：Perl中查找数据的速度更快

sub get_data { 
$x =0 if($_[1] eq "A"); #get column number by name 
$data{'A'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12]; 
return $data{$_[0]}[$x]; 
}

数据存储像这样在Perl文件。我计划不超过100列。然后获得价值我打电话：

get_data(column, row);

现在我意识到，这是非常缓慢的方式来查找表中的数据。我该如何做得更快？ SQL？

来源

2013-05-13 user2376969

如果您想在SQL中执行此操作，请检查DBI模块。 – 2013-05-13 10:16:07

你用'column，row'调用函数，但是然后测试第二个参数是'column'？ – choroba 2013-05-13 10:18:38

我通过编辑回复了评论中的问题。 mpapec，ups固定。 – user2376969 2013-05-13 11:39:55

你github上的代码看，你的主要问题是，你的阵列的大哈希初始化每次函数被调用的时间。

您当前的代码：

my @atom; 
# {'name'}= radius, depth, solvation_parameter, volume, covalent_radius, hydrophobic, H_acceptor, MW 
$atom{'C'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12]; 
$atom{'A'}= [2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, '']; 
$atom{'N'}= [1.75000, 0.16000, -0.00162, 22.44930, 0.75, 0, 1, 14]; 
$atom{'O'}= [1.60000, 0.20000, -0.00251, 17.15730, 0.73, 0, 1, 16]; 
...

时间为您的测试情况采取缓慢的上网本，我在打字：6m24.400s。

要做的最重要的事情就是将其移出该函数，因此当模块被加载时，它只会初始化一次。

这个简单的改变后所花的时间：1分20.714秒。

但自从我提出建议，你可以更清晰地写：

my %atom = (
    C => [ 2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12 ], 
    A => [ 2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, '' ], 
    ... 
);

注意％原子是在两种情况下的哈希，让你的代码没有你被想象的：它声明了一个未使用的词法范围数组@atom，然后继续填充一个不相关的全局变量％atom。（你是否真的想要一个A的MW的空字符串？还有什么样的原子A？）

其次，你的名称到数组索引映射也很慢。当前代码：

#take correct value from data table 
$x = 0 if($_[1] eq "radius"); 
$x = 1 if($_[1] eq "depth"); 
$x = 2 if($_[1] eq "solvation_parameter"); 
$x = 3 if($_[1] eq "volume"); 
$x = 4 if($_[1] eq "covalent_radius"); 
$x = 5 if($_[1] eq "hydrophobic"); 
$x = 6 if($_[1] eq "H_acceptor"); 
$x = 7 if($_[1] eq "MW");

这是更好的完成作为哈希（再次，在函数外初始化）：

my %index = (
    radius    => 0, 
    depth    => 1, 
    solvation_parameter => 2, 
    volume    => 3, 
    covalent_radius  => 4, 
    hydrophobic   => 5, 
    H_acceptor   => 6, 
    MW     => 7 
);

或者，如果你想你可能是时髦：

my %index = map { [qw[radius depth solvation_parameter volume 
         covalent_radius hydrophobic H_acceptor MW 
        ]]->[$_] => $_ } 0..7;

无论哪种方式，功能内的代码就是：

$x = $index{$_[1]};

现在时间：1分13秒49秒。

另一种方法是将您的字段编号定义为常量。

use constant RADIUS=>0, DEPTH=>1, ...;

然后在函数的代码是

$x = $_[1];

，然后你需要使用常量，而不是字符串来调用函数：常量按照惯例大写

get_atom_parameter('C', RADIUS);

我没有试过这个。

但退一步一点，看你如何使用这个功能：通过您呼叫get_atom_parameter两次循环

while($ligand_atom[$x]{'atom_type'}[0]) { 
print STDERR $ligand_atom[$x]{'atom_type'}[0]; 
$y=0; 
while($protein_atom[$y]) { 
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y])) 
- get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius'); 
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius'); 
$y++; 
} 
$x++; 
print STDERR "."; 
}

每次检索半径。但是对于内循环，一个原子始终是恒定的。因此升起要求到get_atom_parameter出内环的，你已经几乎减半的号电话：

while($ligand_atom[$x]{'atom_type'}[0]) {          
print STDERR $ligand_atom[$x]{'atom_type'}[0];         
$y=0;                   
my $lig_radius = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');                

while($protein_atom[$y]) {              
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))   
- $lig_radius 
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius'); 
$y++; 
} 
$x++; 
print STDERR "."; 
}

但还有更多。在您的测试案例中，配体具有35个原子和4128个蛋白质原子。这意味着你的初始代码 4128 * 35 * 2 = 288960调用get_atom_parameter，而现在它的只有4128 * 35 + 35 = 144515的调用，很容易制作一些带有半径的阵列，所以它只有4128 + 35 = 4163调用：

my $protein_size = $#protein_atom; 
my $ligand_size; 
{                    
    my $x=0;                  
    $x++ while($ligand_atom[$x]{'atom_type'}[0]);         
    $ligand_size = $x-1;               
} 
#print STDERR "protein_size = $protein_size, ligand_size = $ligand_size\n"; 
my @protein_radius; 
for my $y (0..$protein_size) { 
    $protein_radius[$y] = get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius'); 
}                    

my @lig_radius; 
for my $x (0..$ligand_size) { 
    $lig_radius[$x] = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius'); 
}                    

for my $x (0..$ligand_size) { 
print STDERR $ligand_atom[$x]{'atom_type'}[0]; 
my $lig_radius = $lig_radius[$x]; 
for my $y (0..$protein_size) { 
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y])) 
- $lig_radius 
- $protein_radius[$y] 
} 
print STDERR "."; 
}

最后，为了distance_sqared [原文如此]呼叫：

#distance between atoms 
sub distance_sqared { 
my $dxs = ($_[0]{'x'}-$_[1]{'x'})**2; 
my $dys = ($_[0]{'y'}-$_[1]{'y'})**2; 
my $dzs = ($_[0]{'z'}-$_[1]{'z'})**2; 
return $dxs+$dys+$dzs; 
}

此功能可有用地用下面的，它使用乘法代替**所取代。

sub distance_sqared {              
my $dxs = ($_[0]{'x'}-$_[1]{'x'});           
my $dys = ($_[0]{'y'}-$_[1]{'y'});           
my $dzs = ($_[0]{'z'}-$_[1]{'z'});           
return $dxs*$dxs+$dys*$dys+$dzs*$dzs;          
}

所有这些修改后的时间：0分53秒39秒。

更多关于**：在其他地方你声明

use constant e_math => 2.71828;

，因此使用它：

$Gauss1 += e_math ** (-(($d[$x][$y]*2)**2));

内置功能exp()计算这件事情（事实上，**是常用实施为x**y = exp(log(x)*y)，所以每次你这样做时，你都是执行一个不必要的对数，其结果只是稍小于1作为你的常数我的 s只准确到6 d.p.）。这种改变会很轻微地改变的输出。再次，** 2应该被乘法代替。

无论如何，这个答案可能足够长，现在计算d[] 不再是它的瓶颈。

摘要：提升循环和函数中的常量值！反复计算同样的东西根本就没什么好玩的。

使用任何类型的数据库都不会帮助您在丝毫的表现。有一件事可以帮助你，但是是Inline::C。 Perl是并非真正为这种密集型计算而构建，并且Inline :: C 将允许您轻松地将性能关键位移入C，而将您现有的I/O保留在Perl中。

我会愿意在部分C端口拍摄。这个代码有多稳定，你想多快？ :)

来源

2013-05-14 23:13:09 hexwab

哇大thx;）代码现在仍然非常实验。 – user2376969 2013-05-22 09:23:44

在DB把这将使它更易于维护，规模扩大，等等....使用DB也可以为您节省大量的RAM - 它获取并存储在RAM只有预期的结果，而不是存储所有值。

关于速度取决于。使用文本文件需要很长时间才能读取所有值到RAM中，但一旦加载，检索值就会超快，比查询数据库更快。

所以这取决于你的程序是如何编写的以及它的用途。你是否读过所有的值，然后运行1000个查询？ TXT文件的方式可能更快。你每次查询时都会读取所有的值（以确保你有最新的值） - 那么数据库会更快。你1查询/日？使用数据库。等等......

来源

2013-05-13 16:23:37 pirhac

Perl中查找数据的速度更快

回答

相关问题