如何查找包含文件中给定整数的行？

我有一个包含一个整数，每行如何查找包含文件中给定整数的行？

123 
456

我想找到文件file包含正是整数在dict行的文件dict。

如果我使用

$ grep -w -f dict file

我得到错误的匹配，如

12345 foo 
23456 bar

这些都是假的，因为12345 != 123和23456 != 456。问题是-w选项也将数字视为单词字符。 -x选项将不起作用，因为file中的行可以具有其他文本。请这样做的最好方法是什么？如果解决方案可以提供进度监控并且在大尺寸的dict和file上表现出色，那将是非常好的。

来源

2012-08-16 qazwsx

您必须使用'grep'，还是打开其他解决方案？ – 2012-08-16 22:16:29

不是。任何命令行工具都可以。 – qazwsx 2012-08-16 22:18:41

您的'grep'命令适用于我，而不会出现您指出的误报。 – 2012-08-17 00:48:29

一个相当普遍的方法，使用awk：

awk 'FNR==NR { array[$1]++; next } { for (i=1; i<=NF; i++) if ($i in array) print $0 }' dict file

说明：

FNR==NR { } ## FNR is number of records relative to the current input file. 
      ## NR is the total number of records. 
      ## So this statement simply means `while we're reading the 1st file 
      ## called dict; do ...` 

array[$1]++; ## Add the first column ($1) to an array called `array`. 
      ## I could use $0 (the whole line) here, but since you have said 
      ## that there will only be one integer per line, I decided to use 
      ## $1 (it strips leading and lagging whitespace; if any) 

next   ## process the next line in `dict` 

for (i=1; i<=NF; i++) ## loop through each column in `file` 

if ($i in array)  ## if one of these columns can be found in the array 

print $0    ## print the whole line out

要处理使用bash循环多个文件：

## This will process files; like file, file1, file2, file3 ... 
## And create output files like, file.out, file1.out, file2.out, file3.out ... 

for j in file*; do awk -v FILE=$j.out 'FNR==NR { array[$1]++; next } { for (i=1; i<=NF; i++) if ($i in array) print $0 > FILE }' dict $j; done

如果你有兴趣在多个文件中使用tee，你可能想尝试这样的：

for j in file*; do awk -v FILE=$j.out 'FNR==NR { array[$1]++; next } { for (i=1; i<=NF; i++) if ($i in array) { print $0 > FILE; print FILENAME, $0 } }' dict $j; done 2>&1 | tee output

这将显示你的文件被进程的名称和匹配记录找到并写入一个'日志'到文件output。

来源

2012-08-16 23:23:51 Steve

是否可以为您解释一下该程序如何工作？ – qazwsx 2012-08-16 23:32:06

另外，awk方法能够监视进度吗？ – qazwsx 2012-08-16 23:34:10

@ user001：如果'dict'非常大，您将无法知道有多少数字已被添加到'array'。但是在读取'file'时，如果找到匹配项，它会立即打印出匹配的行。 – Steve 2012-08-16 23:56:03

你可以做到这一点很容易使用Python脚本，例如：为读者实现

import sys 

numbers = set(open(sys.argv[1]).read().split("\n")) 
with open(sys.argv[2]) as inf: 
    for s in inf: 
     if s.split()[0] in numbers: 
      sys.stdout.write(s)

错误检查和恢复留给。

来源

2012-08-16 22:21:05

那么，理想情况下，我想用Bash命令行上的GNU实用程序来完成它。如果有许多代码行，我可以创建一个脚本。谢谢。 – qazwsx 2012-08-16 22:24:27

添加单词边界进入字典如下：不需要

\<123\> 
\<456\>

-w参数。只需要：

的grep -f字典文件

来源

2012-08-16 22:25:32 oldmonk

这工作。但是，当我用一个包含20000行的'file'试试这个时候，它非常慢。任何改善表现的建议？另外，当速度很慢时，是否可以监视正在处理字典中的哪个条目，以便我可以对该程序的进度有一些了解？ – qazwsx 2012-08-16 23:20:28

请将所有模式写入一行：\ <123\> \ | \ <456\> – oldmonk 2012-08-17 01:28:14

如何查找包含文件中给定整数的行？

回答

相关问题