编辑.....
你们......我们应该先检查。您可能需要查看已存在的用于处理和操作PDB数据的perl模块。
好了...所以Perl是不是我的第一语言,我不知道究竟什么你的数据样子。
编辑:在我的测试数据中出现了奇怪的行。这里有两组代码......一个分割空白,另一个使用预期/已知列位置和长度来确定值。
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\folder\with\datums';
my @filenames = ('glucagon'); # i am using this as my table name, too
my $colnum = 12; # number of columns in data, I assumed this was fixed
my @placehoders;
for (1..$colnum) { push @placehoders, '?'; }
my $placeholders = join(',',@placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
foreach my $file (@filenames) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim whitespace
if ($line ne q{}) { # if not totally blank
my @row = split(/ +/,$line); # split on whitespace
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
$index++;
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
}
-- i didn't know appropriate names for any of it
create table glucagon (
row_id int unsigned auto_increment primary key,
name varchar(10),
seq int,
code1 varchar(5),
code2 varchar(5),
code3 varchar(5),
code4 int,
val1 decimal(10,2),
val2 decimal(10,2),
val3 decimal(10,2),
val4 decimal(10,2),
val5 decimal(10,2),
code5 varchar(5)
)
以下的C:\path\to\folder\with\datums\glucagon.txt
ATOM 1058 N ARG A 141 -6.466 12.036 -10.348 7.00 19.11 N
ATOM 1059 CA ARG A 141 -7.922 12.248 -10.253 6.00 26.80 C
ATOM 1060 C ARG A 141 -8.119 13.499 -9.393 6.00 28.93 C
ATOM 1061 O ARG A 141 -7.112 13.967 -8.853 8.00 28.68 O
ATOM 1062 CB ARG A 141 -8.639 11.005 -9.687 6.00 24.11 C
ATOM 1063 CG ARG A 141 -8.153 10.551 -8.308 6.00 19.20 C
ATOM 1064 CD ARG A 141 -8.914 9.319 -7.796 6.00 21.53 C
ATOM 1065 NE ARG A 141 -8.517 9.076 -6.403 7.00 20.93 N
ATOM 1066 CZ ARG A 141 -9.142 8.234 -5.593 6.00 23.56 C
ATOM 1067 NH1 ARG A 141 -10.150 7.487 -6.019 7.00 19.04 N
ATOM 1068 NH2 ARG A 141 -8.725 8.129 -4.343 7.00 25.11 N
ATOM 1069 OXT ARG A 141 -9.233 14.024 -9.296 8.00 40.35 O
TER 1070 ARG A 141
HETATM 1071 FE HEM A 1 8.128 7.371 -15.022 24.00 16.74 FE
HETATM 1072 CHA HEM A 1 8.617 7.879 -18.361 6.00 17.74 C
HETATM 1073 CHB HEM A 1 10.356 10.005 -14.319 6.00 18.92 C
HETATM 1074 CHC HEM A 1 8.307 6.456 -11.669 6.00 11.00 C
HETATM 1075 CHD HEM A 1 6.928 4.145 -15.725 6.00 13.25 C
最终的结果是发现...
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
18 rows in set (0.00 sec)
哦......看......这排让它变脏...... TER 1070 ARG A 141
。我可以很容易地解决这个问题,如果你走我的路线,但如果你使用其他答案/方法,我不打算更新这个。
好的...对于愚蠢的行。我检查了我的测试数据集中每个值的起始位置和长度。当你加载不同的文件或者什么时,我不知道这些信息是否会随着你的变化而变化......所以我用它可以为你使用的每个文件设置它。
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\datums';
# first num is starting pos, second is length
my $fileinfo = { 'glucagon' => [[0,6], # 'name'
[7,4], # 'seq'
[12,4], # 'code1'
[17,3], # 'code2'
[21,1], # 'code3'
[23,3], # 'code4'
[27,12], # 'val1'
[39,7], # 'val2'
[47,7], # 'val3'
[55,5], # 'val4'
[61,5], # 'val5'
[69,10] # 'code5'
]
# 'second_file' => [ [0,5], # col1
# [6,5], # col2
# ]
}; # i am using this as my table name, too
foreach my $file (keys %$fileinfo) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
my $colnum = scalar @{$fileinfo->{$file}};
my @placehoders;
for (1..$colnum) { push @placehoders, '?'; }
my $placeholders = join(',',@placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim trailing whitespace
if ($line ne q{}) { # if not totally blank
my @row;
my $index = 1;
# foreach value column position & length
foreach my $col (@{$fileinfo->{$file}}) {
my $value;
if ($col->[0] <= length($line)) {
$value = substr($line,$col->[0],$col->[1]);
$value =~ s/^\s+|\s+$//g; # trim trailing & leading whitespace
if ($value eq q{}) { undef $value; } # i like null values vs blank
}
$row[$index] = $value;
$index++;
}
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
}
新的数据:
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | NULL | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
我不是化学家......我不知道你在做什么究竟。就个人而言,如果我使用大型数据集......我会使用数据库。 (但是我相信你很可能需要一些你不能在db中模仿的函数......) – 2014-10-10 16:35:08
它只是找到文件中坐标之间的距离。我真的不知道你使用数据库是什么意思,问题在于处理数组。我制作了一些大型数组,并通过脚本以各种方式使用它们。我只是想找到一些信息,以便我的电脑内存使用较少的信息。 – 2014-10-10 16:39:14
是的。就我个人而言,我会解析这些文件并将它们加载到数据库的表中,因为它们旨在处理大量数据和计算。 [这是我们正在谈论的?](http://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) – 2014-10-10 16:40:55