2014-10-10 72 views
0

我当前的脚本通过PDB中的数据运行,并将它们存储到数组,然后这些数组用于脚本的其余部分。这个脚本在一个小的PDB文件上运行得非常好,但是当我使用一个真正的PDB文件时,我最终只在一个PDB文件中使用了所有的计算机内存。我有2000个PDB文件需要进行计算。由于大数组而导致脚本故障

这是我最新的脚本和一些笔记。

完整的脚本:

#!/usr/bin/perl 

use warnings; 
use strict; 

#my $inputfile = $ARGV[0]; 
#my $inputfile = '8ns_emb_alt_test.pdb'; 
my $inputfile = '8ns_emb_alt_101.pdb'; 

open(INPUTFILE, "<", $inputfile) or die $!; 

my @array = <INPUTFILE>; 

### Protein 
my $protein = 'PROT'; 
my @protx; 
my @proty; 
my @protz; 

for (my $line = 0; $line <= $#array; ++$line) { 
    if (($array[$line] =~ m/\s+$protein\s+/)) { 
     chomp $array[$line]; 
     my @splitline = (split /\s+/, $array[$line]); 
     push @protx, $splitline[5]; #This has 2083 x-coordinates 
     push @proty, $splitline[6]; #this has 2083 y-coordinates 
     push @protz, $splitline[7]; #this has 2083 z-coordinates 
    } 
} 

### Lipid 
my $lipid1 = 'POPS'; 
my $lipid2 = 'POPC'; 
my @lipidx; 
my @lipidy; 
my @lipidz; 

for (my $line = 0; $line <= $#array; ++$line) { 
    if (($array[$line] =~ m/\s+$lipid1\s+/) || ($array[$line] =~ m/\s+$lipid2\s+/)) { 
     chomp $array[$line]; 
     my @splitline = (split /\s+/, $array[$line]); 
     push @lipidx, $splitline[5]; #this has approximately 35,000 x coordinates 
     push @lipidy, $splitline[6]; #same as above for y 
     push @lipidz, $splitline[7]; #same as above for z 
    } 
} 

### Calculation 
my @deltaX = map { 
    my $diff = $_; 
    map { $diff - $_ } @lipidx 
} @protx; #so this has 2083*35000 data x-coordinates 
my @squared_deltaX = map { $_ * $_ } @deltaX; #this has all the x-coordinates squared from @deltaX 

my @deltaY = map { 
    my $diff = $_; 
    map { $diff - $_ } @lipidy 
} @proty; 
my @squared_deltaY = map { $_ * $_ } @deltaY; 

my @deltaZ = map { 
    my $diff = $_; 
    map { $diff - $_ } @lipidz 
} @protz; 
my @squared_deltaZ = map { $_ * $_ } @deltaZ; 

my @distance; 
for (my $ticker = 0; $ticker <= $#array; ++$ticker) { 
    my $distance_calc = sqrt(($squared_deltaX[$ticker] + $squared_deltaY[$ticker] + $squared_deltaZ[$ticker])); 
    push @distance, $distance_calc; 
} #this runs the final calculation and computes all the distances between the atoms 

### The Hunt 
my $limit = 5; 
my @DistU50; 
my @resid_tagger; 

for (my $tracker = 0; $tracker <= $#array; ++$tracker) { 
    my $dist = $distance[$tracker]; 
    if (($dist < $limit) && ($array[$tracker] =~ m/\s+$protein\s+/)) { 
     my @splitline = (split /\s+/, $array[$tracker]); 
     my $LT50 = $dist; 
     push @resid_tagger, $splitline[4]; #selects stores a selected index number 
     push @DistU50,  $LT50;   #stores the values within the $limit 
    } 
} #this 'for' loop search for all the elements in the '@distance' and pushes them to the final arrays and also makes an array of certain index numbers in to another array. 

### Le'Finali 
print "@resid_tagger = resid \n"; 
print "5 > @DistU50 \n"; 

close INPUTFILE; 

我的一个朋友的实验室说,我可能有些数据存储到文件,以便它占用的内存更少。我认为这是一个好主意,但我不确定哪里是最有效率的地方,我需要做多少次。我对数组做了这个,因为这是我知道如何做到这一点的最佳方式。

如果有人能告诉我一个例子,我可以将一个数组转换成一个文件,然后再次使用该文件中的数据,这将是非常有用的。否则,如果有任何人有我可以查找的想法,尝试的事情或只是建议,至少会启动我的某个地方。

+3

我不是化学家......我不知道你在做什么究竟。就个人而言,如果我使用大型数据集......我会使用数据库。 (但是我相信你很可能需要一些你不能在db中模仿的函数......) – 2014-10-10 16:35:08

+0

它只是找到文件中坐标之间的距离。我真的不知道你使用数据库是什么意思,问题在于处理数组。我制作了一些大型数组,并通过脚本以各种方式使用它们。我只是想找到一些信息,以便我的电脑内存使用较少的信息。 – 2014-10-10 16:39:14

+1

是的。就我个人而言,我会解析这些文件并将它们加载到数据库的表中,因为它们旨在处理大量数据和计算。 [这是我们正在谈论的?](http://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) – 2014-10-10 16:40:55

回答

4

编辑.....

你们......我们应该先检查。您可能需要查看已存在的用于处理和操作PDB数据的perl模块。


好了...所以Perl是不是我的第一语言,我不知道究竟什么你的数据样子。

编辑:在我的测试数据中出现了奇怪的行。这里有两组代码......一个分割空白,另一个使用预期/已知列位置和长度来确定值。

#!/usr/bin/perl 
use strict; 
use warnings; 

use DBI; 

my $db = 'test'; 
my $host = 'localhost'; 
my $user = 'root'; 
my $pass = ''; 

my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass) 
      or die "Connection Error: $DBI::errstr\n"; 

my $localpath = 'C:\path\to\folder\with\datums'; 

my @filenames = ('glucagon'); # i am using this as my table name, too 
my $colnum = 12; # number of columns in data, I assumed this was fixed 

my @placehoders; 
for (1..$colnum) { push @placehoders, '?'; } 
my $placeholders = join(',',@placehoders); # builds a string like: ?, ?, ?, ?, ... 
              # for our query that uses binding 
foreach my $file (@filenames) { 
    my $filename = "$localpath\\$file.txt"; 

    if (open(my $fh => $filename)) { 

     # the null at the start of the insert is because my first column is an 
     # auto_incrememnt primary key that will be generated on insert by the db 
     my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)"); 

     while(my $line = <$fh>) { 
     $line =~ s/\s+$//; # trim whitespace 

     if ($line ne q{}) { # if not totally blank 
      my @row = split(/ +/,$line); # split on whitespace 

      for my $index (1..$colnum) { 
       $stmt->bind_param($index, $row[$index]); 
       $index++; 
      } 
      $stmt->execute(); 
     } 
     } 
     close $fh; 
    } 
    else { print "$file not opened\n"; } 
} 
-- i didn't know appropriate names for any of it 
create table glucagon (
    row_id int unsigned auto_increment primary key, 
    name varchar(10), 
    seq  int, 
    code1 varchar(5), 
    code2 varchar(5), 
    code3 varchar(5), 
    code4 int, 
    val1 decimal(10,2), 
    val2 decimal(10,2), 
    val3 decimal(10,2), 
    val4 decimal(10,2), 
    val5 decimal(10,2), 
    code5 varchar(5) 
) 

以下的C:\path\to\folder\with\datums\glucagon.txt

ATOM 1058 N ARG A 141  -6.466 12.036 -10.348 7.00 19.11   N 
ATOM 1059 CA ARG A 141  -7.922 12.248 -10.253 6.00 26.80   C 
ATOM 1060 C ARG A 141  -8.119 13.499 -9.393 6.00 28.93   C 
ATOM 1061 O ARG A 141  -7.112 13.967 -8.853 8.00 28.68   O 
ATOM 1062 CB ARG A 141  -8.639 11.005 -9.687 6.00 24.11   C 
ATOM 1063 CG ARG A 141  -8.153 10.551 -8.308 6.00 19.20   C 
ATOM 1064 CD ARG A 141  -8.914 9.319 -7.796 6.00 21.53   C 
ATOM 1065 NE ARG A 141  -8.517 9.076 -6.403 7.00 20.93   N 
ATOM 1066 CZ ARG A 141  -9.142 8.234 -5.593 6.00 23.56   C 
ATOM 1067 NH1 ARG A 141  -10.150 7.487 -6.019 7.00 19.04   N 
ATOM 1068 NH2 ARG A 141  -8.725 8.129 -4.343 7.00 25.11   N 
ATOM 1069 OXT ARG A 141  -9.233 14.024 -9.296 8.00 40.35   O 
TER 1070  ARG A 141 
HETATM 1071 FE HEM A 1  8.128 7.371 -15.022 24.00 16.74   FE 
HETATM 1072 CHA HEM A 1  8.617 7.879 -18.361 6.00 17.74   C 
HETATM 1073 CHB HEM A 1  10.356 10.005 -14.319 6.00 18.92   C 
HETATM 1074 CHC HEM A 1  8.307 6.456 -11.669 6.00 11.00   C 
HETATM 1075 CHD HEM A 1  6.928 4.145 -15.725 6.00 13.25   C 

最终的结果是发现...

mysql> select * from glucagon; 
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+ 
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 | 
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+ 
|  1 | ATOM | 1058 | N  | ARG | A  | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N  | 
|  2 | ATOM | 1059 | CA | ARG | A  | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C  | 
|  3 | ATOM | 1060 | C  | ARG | A  | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C  | 
|  4 | ATOM | 1061 | O  | ARG | A  | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O  | 
|  5 | ATOM | 1062 | CB | ARG | A  | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C  | 
|  6 | ATOM | 1063 | CG | ARG | A  | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C  | 
|  7 | ATOM | 1064 | CD | ARG | A  | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C  | 
|  8 | ATOM | 1065 | NE | ARG | A  | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N  | 
|  9 | ATOM | 1066 | CZ | ARG | A  | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C  | 
|  10 | ATOM | 1067 | NH1 | ARG | A  | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N  | 
|  11 | ATOM | 1068 | NH2 | ARG | A  | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N  | 
|  12 | ATOM | 1069 | OXT | ARG | A  | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O  | 
|  13 | TER | 1070 | ARG | A  | 141 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 
|  14 | HETATM | 1071 | FE | HEM | A  |  1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE | 
|  15 | HETATM | 1072 | CHA | HEM | A  |  1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C  | 
|  16 | HETATM | 1073 | CHB | HEM | A  |  1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C  | 
|  17 | HETATM | 1074 | CHC | HEM | A  |  1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C  | 
|  18 | HETATM | 1075 | CHD | HEM | A  |  1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C  | 
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+ 
18 rows in set (0.00 sec) 

哦......看......这排让它变脏...... TER 1070 ARG A 141。我可以很容易地解决这个问题,如果你走我的路线,但如果你使用其他答案/方法,我不打算更新这个。


好的...对于愚蠢的行。我检查了我的测试数据集中每个值的起始位置和长度。当你加载不同的文件或者什么时,我不知道这些信息是否会随着你的变化而变化......所以我用它可以为你使用的每个文件设置它。

#!/usr/bin/perl 
use strict; 
use warnings; 

use DBI; 

my $db = 'test'; 
my $host = 'localhost'; 
my $user = 'root'; 
my $pass = ''; 

my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass) 
      or die "Connection Error: $DBI::errstr\n"; 

my $localpath = 'C:\path\to\datums'; 

           # first num is starting pos, second is length 
my $fileinfo = { 'glucagon' => [[0,6], # 'name' 
           [7,4], # 'seq' 
           [12,4], # 'code1' 
           [17,3], # 'code2' 
           [21,1], # 'code3' 
           [23,3], # 'code4' 
           [27,12], # 'val1' 
           [39,7], # 'val2' 
           [47,7], # 'val3' 
           [55,5], # 'val4' 
           [61,5], # 'val5' 
           [69,10] # 'code5' 
           ] 
       # 'second_file' => [ [0,5], # col1 
       #      [6,5], # col2      
       #     ] 
       }; # i am using this as my table name, too 

foreach my $file (keys %$fileinfo) { 
    my $filename = "$localpath\\$file.txt"; 

    if (open(my $fh => $filename)) { 
     my $colnum = scalar @{$fileinfo->{$file}}; 

     my @placehoders; 
     for (1..$colnum) { push @placehoders, '?'; } 
     my $placeholders = join(',',@placehoders); # builds a string like: ?, ?, ?, ?, ... 
                # for our query that uses binding 

     # the null at the start of the insert is because my first column is an 
     # auto_incrememnt primary key that will be generated on insert by the db 
     my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)"); 

     while(my $line = <$fh>) { 
     $line =~ s/\s+$//; # trim trailing whitespace 
     if ($line ne q{}) { # if not totally blank 
      my @row; 
      my $index = 1; 
      # foreach value column position & length 
      foreach my $col (@{$fileinfo->{$file}}) { 
       my $value; 
       if ($col->[0] <= length($line)) { 
        $value = substr($line,$col->[0],$col->[1]); 
        $value =~ s/^\s+|\s+$//g; # trim trailing & leading whitespace 
        if ($value eq q{}) { undef $value; } # i like null values vs blank 
       } 
       $row[$index] = $value; 
       $index++; 
      } 

      for my $index (1..$colnum) { 
       $stmt->bind_param($index, $row[$index]); 
      } 
      $stmt->execute(); 
     } 
     } 
     close $fh; 
    } 
    else { print "$file not opened\n"; } 
} 

新的数据:

mysql> select * from glucagon; 
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+ 
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 | 
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+ 
|  1 | ATOM | 1058 | N  | ARG | A  | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N  | 
|  2 | ATOM | 1059 | CA | ARG | A  | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C  | 
|  3 | ATOM | 1060 | C  | ARG | A  | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C  | 
|  4 | ATOM | 1061 | O  | ARG | A  | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O  | 
|  5 | ATOM | 1062 | CB | ARG | A  | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C  | 
|  6 | ATOM | 1063 | CG | ARG | A  | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C  | 
|  7 | ATOM | 1064 | CD | ARG | A  | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C  | 
|  8 | ATOM | 1065 | NE | ARG | A  | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N  | 
|  9 | ATOM | 1066 | CZ | ARG | A  | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C  | 
|  10 | ATOM | 1067 | NH1 | ARG | A  | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N  | 
|  11 | ATOM | 1068 | NH2 | ARG | A  | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N  | 
|  12 | ATOM | 1069 | OXT | ARG | A  | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O  | 
|  13 | TER | 1070 | NULL | ARG | A  | 141 | NULL | NULL | NULL | NULL | NULL | NULL | 
|  14 | HETATM | 1071 | FE | HEM | A  |  1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE | 
|  15 | HETATM | 1072 | CHA | HEM | A  |  1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C  | 
|  16 | HETATM | 1073 | CHB | HEM | A  |  1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C  | 
|  17 | HETATM | 1074 | CHC | HEM | A  |  1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C  | 
|  18 | HETATM | 1075 | CHD | HEM | A  |  1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C  | 
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+ 
5

你正在试图在数组中存储6600万个结果,正如你所注意的那样,这两个结果都是缓慢的和内存密集的。 Perl数组对于像这样的大规模计算并不好,但PDL是。

你问题的核心是计算一些3D坐标之间的距离。让我们做这样一个简单的数据首先设置只是为了证明我们可以做到这一点:

Start   End 
--------- --------- 
(0, 0, 0) (1, 2, 3) 
(1, 1, 1) (1, 1, 1) 
(4, 5, 6) (7, 8, 9) 

我们可以代表这样的PDL这组数据:

use PDL; 
        # x  # y  # z  
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ]; 
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ]; 

我们现在有两套三维坐标。为了计算距离,第一我们减去我们开始从我们的最终坐标坐标:

my $diff = $end - $start; 
print $diff; 

此输出

[ 
[1 0 3] 
[2 0 3] 
[3 0 3] 
] 

其中的x坐标的区别是第一行中,在差异y坐标位于第二行,而z坐标的差值位于第三行。

接下来我们方的差异:

my $squared = $diff**2; 
print $squared; 

这让我们

[ 
[1 0 9] 
[4 0 9] 
[9 0 9] 
] 

最后,我们需要总结的每对点差的平方和取平方根:

foreach my $i (0 .. $squared->dim(0) - 1) { 
    say sqrt sum $squared($i,:); 
} 

(可能有更好的方法来做到这一点,但我没有使用PDL多。)

此打印出

3.74165738677394 
0 
5.19615242270663 

这是我们的距离。

全部放在一起:

use strict; 
use warnings; 
use 5.010; 

use PDL; 
use PDL::NiceSlice; 

my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ]; 
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ]; 

my $diff = $end - $start; 
my $squared = $diff**2; 

foreach my $i (0 .. $squared->dim(0) - 1) { 
    say sqrt sum $squared($i,:); 
} 

它需要〜35秒我的桌面上,计算百万对坐标之间的距离,并将结果写入到文件中。当我尝试使用一千万对时,内存不足,因此您可能必须将数据集分割成几部分。从文件

这里

读取数据是从两个文件中读取数据,使用您在前面的问题都包括一个样本输入一个例子:

use strict; 
use warnings; 
use 5.010; 

use PDL; 
use PDL::IO::Misc; 
use PDL::NiceSlice; 

my $start_file = 'start.txt'; 
my $end_file = 'end.txt'; 

my $start = rcols $start_file, [ 5..7 ]; 
my $end = rcols $end_file, [ 5..7 ]; 

my $diff = $end - $start; 
my $squared = $diff**2; 

foreach my $i (0 .. $squared->dim(0) - 1) { 
    say sqrt sum $squared($i,:); 
} 

start.txt

ATOM  1 N GLU 1  -19.992 -2.816 36.359 0.00 0.00  PROT 
ATOM  2 HT1 GLU 1  -19.781 -1.880 35.958 0.00 0.00  PROT 
ATOM  3 HT2 GLU 1  -19.713 -2.740 37.358 0.00 0.00  PROT 
ATOM  4 HT3 GLU 1  -21.027 -2.910 36.393 0.00 0.00  PROT 
ATOM  5 CA GLU 1  -19.344 -3.944 35.652 0.00 0.00  PROT 
ATOM  6 HA GLU 1  -19.817 -4.852 35.998 0.00 0.00  PROT 
ATOM  7 CB GLU 1  -19.501 -3.795 34.119 0.00 0.00  PROT 

end.txt

ATOM 2084 N POPC 1  -44.763 27.962 20.983 0.00 0.00  MEM1 
ATOM 2085 C12 POPC 1  -46.144 27.379 20.551 0.00 0.00  MEM1 
ATOM 2086 C13 POPC 1  -44.923 28.611 22.367 0.00 0.00  MEM1 
ATOM 2087 C14 POPC 1  -43.713 26.889 21.099 0.00 0.00  MEM1 
ATOM 2088 C15 POPC 1  -44.302 29.004 20.059 0.00 0.00  MEM1 
ATOM 2089 H12A POPC 1  -46.939 28.110 20.555 0.00 0.00  MEM1 
ATOM 2090 H12B POPC 1  -46.486 26.769 21.374 0.00 0.00  MEM1 

输出

42.3946824613654 
42.2903357636233 
42.9320321205507 
40.4541893133455 
44.1770768272415 
45.3936402704167 
42.7174829080553 

rcols函数来自PDL::IO::Misc和可用于从文件中读取特定列插入PDL对象(在此情况下,列5至7,零索引)。

+0

是不是一个优雅的解决方案.... – 2014-10-10 18:58:14