快速替代到grep -f

file.contain.query.txt快速替代到grep -f

ENST001 

ENST002 

ENST003

file.to.search.in.txt

ENST001 90 

ENST002 80 

ENST004 50

因为ENST003在第二个文件，并ENST004没有进入在第一个文件中没有进入预期的输出结果是：

ENST001 90 

ENST002 80

要在特定的文件，我们通常做以下的grep多查询：

grep -f file.contain.query <file.to.search.in >output.file

因为我有像10000查询和几乎100000原始file.to.search.in需要很长时间才能完成（如5小时）。有没有一种快速替代grep -f？

来源

2012-07-15 user1421408

您的需求是？你想要一个文件的第二行用第一个键的过滤吗？ – 2012-07-15 06:54:16

我编辑了预期的结果 – user1421408 2012-07-15 06:56:40

输入重定向是不必要的。 – 2012-07-15 11:02:49

如果你想要一个纯Perl的选项，阅读你的查询文件的密钥到一个哈希表，然后检查标准输入对这些按键：

#!/usr/bin/env perl 
use strict; 
use warnings; 

# build hash table of keys 
my $keyring; 
open KEYS, "< file.contain.query.txt"; 
while (<KEYS>) { 
    chomp $_; 
    $keyring->{$_} = 1; 
} 
close KEYS; 

# look up key from each line of standard input 
while (<STDIN>) { 
    chomp $_; 
    my ($key, $value) = split("\t", $_); # assuming search file is tab-delimited; replace delimiter as needed 
    if (defined $keyring->{$key}) { print "$_\n"; } 
}

你会使用它，像这样：

lookup.pl < file.to.search.txt

哈希表可以利用的内存相当，但搜索是多少更快（哈希表查找是在不变的时间），这是很方便的，因为你有10倍以上的查找键比存储。

来源

2012-07-15 07:12:15

这是法拉利与grep -f相比时的感谢 – user1421408 2012-07-15 07:22:10

完美的解决方案; +1 – 2012-07-15 10:18:56

此的Perl代码可能帮助你：

use strict; 
open my $file1, "<", "file.contain.query.txt" or die $!; 
open my $file2, "<", "file.to.search.in.txt" or die $!; 

my %KEYS =(); 
# Hash %KEYS marks the filtered keys by "file.contain.query.txt" file 

while(my $line=<$file1>) { 
    chomp $line; 
    $KEYS{$line} = 1; 
} 

while(my $line=<$file2>) { 
    if($line =~ /(\w+)\s+(\d+)/) { 
     print "$1 $2\n" if $KEYS{$1}; 
    } 
} 

close $file1; 
close $file2;

来源

2012-07-15 07:07:13

你忘了检查系统调用的返回值。 – tchrist 2012-07-15 16:08:05

Mysql：

将数据导入到Mysql或类似软件将提供巨大的改进。这是可行的吗？您可以在几秒钟内看到结果。

mysql -e 'select search.* from search join contains using (keyword)' > outfile.txt 

# but first you need to create the tables like this (only once off) 

create table contains (
    keyword varchar(255) 
    , primary key (keyword) 
); 

create table search (
    keyword varchar(255) 
    ,num bigint 
    ,key (keyword) 
); 

# and load the data in: 

load data infile 'file.contain.query.txt' 
    into table contains fields terminated by "add column separator here"; 
load data infile 'file.to.search.in.txt' 
    into table search fields terminated by "add column separator here";

来源

2012-07-15 07:18:12

我没有测试过这个，但它会根据你的情况稍作调整。除非你希望它是以内存为基础的，否则它只需要很少的内存。 – 2012-07-15 07:19:41

use strict; 
use warings; 

system("sort file.contain.query.txt > qsorted.txt"); 
system("sort file.to.search.in.txt > dsorted.txt"); 

open (QFILE, "<qsorted.txt") or die(); 
open (DFILE, "<dsorted.txt") or die(); 


while (my $qline = <QFILE>) { 
    my ($queryid) = ($qline =~ /ENST(\d+)/); 
    while (my $dline = <DFILE>) { 
    my ($dataid) = ($dline =~ /ENST(\d+)/); 
    if ($dataid == $queryid) { print $qline; } 
    elsif ($dataid > $queryid) { break; } 
    } 
}

来源

2012-07-15 07:26:56 perreal

如果你有固定的字符串，请使用grep -F -f。这比正则表达式搜索要快得多。

来源

2012-07-15 08:17:50 tripleee

如果文件已经排序：

join file1 file2

如果不是：

join <(sort file1) <(sort file2)

来源

2012-07-15 11:01:57

如果您正在使用的perl版本5.10或更高版本，您可以加入“查询”项为正则表达式查询条件由'pipe'分隔。（例如：ENST001|ENST002|ENST003）Perl构建了一个'trie'，它像散列一样在不断的时间内进行查找。它应该使用查找哈希运行速度与解决方案一样快。只是为了展示另一种方式来做到这一点。

#!/usr/bin/perl 
use strict; 
use warnings; 
use Inline::Files; 

my $query = join "|", map {chomp; $_} <QUERY>; 

while (<RAW>) { 
    print if /^(?:$query)\s/; 
} 

__QUERY__ 
ENST001 
ENST002 
ENST003 
__RAW__ 
ENST001 90 
ENST002 80 
ENST004 50

来源

2012-07-15 15:13:27

快速替代到grep -f

回答

相关问题