2017-01-23 68 views
-2

我正在尝试从终端搜索pdf文件。我试图从终端提供搜索字符串。搜索字符串可以是一个单词,多个单词(AND,OR)或一个精确的短语。我想只为所有搜索查询保留一个参数。我将把以下命令保存为一个shell脚本,并将shell脚本作为zsh或bash shell中的.aliases的别名进行调用。多种模式的一个参数 - grep

以下是sjr的回答,这里是:search multiple pdf files

我用SJR的回答是这样的:

find ${1} -name '*.pdf' -exec sh -c 'pdftotext "{}" - | 
     grep -E -m'${2}' --line-buffered --label="{}" '"${3}"' '${4}'' \; 

$1采用路径

$2限制的结果数量

$3是上下文参数(这是接受-A,-B ,-C,单独或联合)

$4以搜索条件g

我面临的问题是与$4值。正如我前面所说,我希望这个参数传递我的搜索字符串,它可以是一个短语或一个字或多个单词与/或关系。

我无法获得理想的结果,直到现在我没有获得词组搜索的搜索结果,直到我遵循Robin Green的评论。但仍然短语结果不准确。

编辑从判断文本:

The original rule was that you could not claim for psychiatric injury in 
negligence. There was no liability for psychiatric injury unless there was also 
physical injury (Victorian Rly Commrs v Coultas [1888]). The courts were worried 
both about fraudulent claims and that if they allowed claims, the floodgates would 
open. 

The claimant was 15 metres away behind a tram and did not see the accident but 
later saw blood on the road. She suffered nervous shock and had a miscarriage. She 
sued for negligence. The court held that it was not reasonably foreseeable that 
someone so far away would suffer shock and no duty of care was owed. 

White v Chief Constable of South Yorkshire [1998] The claimants were police 
officers who all had some part in helping victims at Hillsborough and suffered 
psychiatric injury. The House of Lords held that rescuers did not have a special 
position and had to follow the normal rules for primary and secondary victims. 
They were not in physical danger and not therefore primary victims. Neither could 
they establish they had a close relationship with the injured so failed as 
secondary victims. It is necessary to define `nervous shock' which is the rather 
quaint term still sometimes used by lawyers for various kinds of 
psychiatric injury...rest of para 

word1可以是:休克,(神经性休克)

word2可以是:精神病

exact phrase:(神经性休克)

命令

alias s='sh /path/shell/script.sh' 
export p='path/pdf/files' 

在终端:

s "$p" 10 -5 "word1/|word2"   #for OR search 
s "$p" 10 -5 "word1.*word2.*word3" #for AND search 
s "$p" 10 -5 ""exact phrase""  #for phrase search 

第二个测试样品: 一个例子pdf文件,由于pdf文件命令运行:Test-File。它的4页(361微克文件的一部分)

如果我们在其上运行下面的命令,作为解决方案提到:

s "$p" 10 -5 'doctrine of basic structure' > ~/desktop/BSD.txt && open ~/desktop/BSD.txt

我们会得到相关的文字和“会避免穿通整个文件。认为这将是一个很酷的方式来阅读我们想要的,而不是传统的方法。

+1

为什么downvote?想要知道,以便在提问时我可以保重。 – lawsome

+2

单引号将导致引用的参数不被扩展(假设您使用bash或sh),这不是您想要的。你应该使用双引号来引用bash或sh中的参数。或者你正在使用其他一些shell? –

+1

我没有投票,我也希望人们在他们这样做时会留下反馈。也就是说,将[MCVE(Minimal,Complete,and Verifiable Example)](http://stackoverflow.com/help/mcve)的问题减少到总是值得。有关提问的一般提示可以在这里找到(http://stackoverflow.com/help/how-to-ask)。 – mklement0

回答

1

您需要:

  • 为了传递一个双引号命令字符串sh -c嵌入式壳变量引用要扩展(然后需要逃离嵌入式"实例作为\" )。

  • 报价与printf %q正则表达式安全列入命令字符串 - 注意,这需要bashksh,或zsh作为外壳。

dir=$1 
numMatches=$2 
context=$3 
regexQuoted=$(printf %q "$4") 

find "${dir}" -type f -name '*.pdf' -exec sh -c "pdftotext \"{}\" - | 
    grep -E -m${numMatches} --with-filename --label=\"{}\" ${context} ${regexQuoted}" \; 

3个调用场景将被:

s "$p" 10 -5 'word1|word2'   #for OR search 
s "$p" 10 -5 'word1.*word2.*word3' #for AND search 
s "$p" 10 -5 'exact phrase'   #for phrase search 

注意,没有必要逃避|,无需加双引号的一个额外层周围exact phrase

另请注意,我已将--line-buffered替换为--with-filename,因为我认为这就是您的意思(以PDF文件路径为前缀的匹配行)。


注意,上述方法的壳实例必须为输入路径,这是低效的产生,所以考虑重写你的命令,如下所示,这也避免了需要printf %q(假设regex=$4):

find "${dir}" -type f -name '*.pdf' | 
    while IFS= read -r file; do 
    pdftotext "$f" - | 
     grep -E -m${numMatches} --with-filename --label="$f" ${context} "${regex}" 
    done 

上述假设您的文件名有没有嵌入换行,这是很少现实世界的关注。如果是,有办法解决这个问题。

这个解决方案的另外一个优势是,它仅使用POSIX兼容的功能,但要注意,grep命令使用非标准的选项。