通过将命令行工具并入gash信号量的bash脚本中并行化

我必须在由50000个文件组成的相当大的基准测试中对命令行工具进行评估。
不幸的是，这个工具并没有并行化，并且在这个尺寸的基准上依序运行时间太长。
我读了一些关于gnu parallel（或gnu semaphore）的文章，但是我找不到一个很好的例子来说明如何将gnu信号量产生的多个后台进程的结果结合起来。通过将命令行工具并入gash信号量的bash脚本中并行化

打开的工具需要一个文件作为输入参数，我必须找出一种方法来收集通过多次并行运行工具所产生的所有结果。
此外，我不想在崩溃的情况下失去任何结果。
只要脚本被取消，它就不应该重新处理之前已经处理过的任何文件。

为确保后台进程worker有足够的工作要做，下面的脚本一次将多个文件传递给worker。
bash脚本对我的用例非常有效。

如果有人有类似的问题，我想与你分享脚本。
通过修改worker函数并修改变量$JOBS和$WPSIZE，可以使脚本适应其他用例。

如果您可以提供一些关于如何使脚本更高效的反馈，我会非常高兴。

非常感谢，朱利安

#!/bin/bash 

# make variables available in function started by 
# gnu semaphore 
export FINALRES="result.log" 
export RESFIFO="/tmp/res.fifo" 
export FILFIFO="/tmp/fil.fifo" 
export FILELIST="/tmp/flist" 
export WPSIZE=5 
export JOBS=4 

PUTFPID="" 
WRITPID="" 

# find input files fo process 
find . -name "*.txt" > ${FILELIST} 

# setup fifos and files 
[ ! -e "${FINALRES}" ] && touch "${FINALRES}" 
[ ! -e "${RESFIFO}" ] && mkfifo "${RESFIFO}" 
[ ! -e "${FILFIFO}" ] && mkfifo "${FILFIFO}" 

FILES=$(diff ${FINALRES} ${FILELIST} | grep '>' | cut -d '>' -f2 | tr -d ' ') 
exec 4<> ${RESFIFO} 
exec 5<> ${FILFIFO} 

trap cleanup EXIT TERM 

function cleanup() { 
    # write results that have been obainted so far 
    echo "cleanup" 
    [ -n "${PUTFPID}" ] && (kill -9 ${PUTFPID} 2>&1) > /dev/null 
    [ -n "${WRITPID}" ] && (kill -9 ${WRITPID} 2>&1) > /dev/null 
    rm -f "${RESFIFO}" 
    rm -f "${FILFIFO}" 
    rm -f "${LOCKFILE}" 
} 

# this function takes always #WPSIZE (or less) files from the fifo 
function readf() { 
    local cnt=0 
    while read -r -t 2 line; do 
    echo "$line" 
    [ -z "${files}" ] && { files=${line}; let cnt=${cnt}+1; continue; } 
    let cnt=${cnt}+1 
    [ ${cnt} -eq ${WPSIZE} ] && break 
    done <& 5 
} 

# this function is called by gnu semaphore and executed in the background 
function worker() { 
    for fil in "${@}"; do 
    # do something ... 
    echo "result" > "${RESFIFO}" 
    done 
    exit 0 
} 

# this function is used (at the end) to write the comutation results to a file 
function writeresult() { 
    while read -r line; do 
    [ "${line}" = "quit" ] && break 
    echo "${line}" >> ${FINALRES} 
    done < ${RESFIFO} 
} 

# this simple helper puts all input files into a fifo 
function putf() { 
    for fil in $FILES; do 
    echo "${fil}" > "${FILFIFO}" 
    done 
} 

# make function worker known to gnu semaphore 
export -f worker 
# put file into fifo 
putf & 
PUTFPID=$! 
writeresult & 
WRITPID=$! 

while true; do 
    ARGS=$(readf) 
    [ -z "${ARGS}" ] && break 
    # used word spitting on purpose here (call worker with multiple params) 
    sem --bg --jobs "${JOBS}" worker ${ARGS} 
done 

sem --wait 

echo "quit" > ${RESFIFO} 
wait 

echo "all jobs are finished" 
exit 0

来源

2016-09-30 Julian

请看看：http：//www.shellcheck.net/ – Cyrus

谢谢，我根据spellcheck.net的健全性检查改变了脚本，除了在'sem --bg --jobs'行分割的单词$ {JOBS}“工作人员$ {ARGS}'这是我特意做的;-)。 – Julian

您可以将'>> {$ FINALRES}''放在包含它的循环之外，这样您就不必逐个寻找并附加每个结果。 –

追加到FIFO并行通常是一个坏主意：你真的需要知道了很多关于这个版本的操作系统缓冲区的FIFO对于如何安全。这个例子表明为什么：

#!/bin/bash 

size=3000 

myfifo=/tmp/myfifo$$ 
mkfifo $myfifo 

printone() { 
    a=$(perl -e 'print ((shift)x'$size')' $1) 
    # Print a single string 
    echo $a >> $myfifo 
} 
printone a & 
printone b & 
printone c & 
printone d & 

# Wait a little to get the printones started 
sleep .1 

cat $myfifo | perl -ne 'for(split//,$_){ 
    if($_ eq $l) { 
    $c++ 
    } else { 
    /\n/ and next; 
    print $l,1+$c," "; $l=$_; $c=0; 
    } 
}' 
echo

随着size=10你总是会得到：

1 a10 b10 c10

这意味着，从FIFO读取10，再接一个10周B的随后10℃的。即没有混合。

但其更改为size=100000和你喜欢的东西：

1 d65536 b65536 c100000 d34256 b34256 a100000 d208

65K D的读取，然后65K B的，则100K C'S，然后34K D的，32K B的，那么10万分一个的，最后208分D的。即四项产出混合在一起。非常不好。

因此，我建议不要并行添加到同一个FIFO：存在竞争条件的风险，并且通常可以避免。

你的情况，似乎你只是想# do something ...到每个50000个的文件，那就是死的简单：

do_something() { 
    # do something ... 
    echo do something to $1 
    echo result of $1 is foo 
} 
export -f do_something 
find . -name "*.txt" | parallel do_something > results

这里GNU并行通过确保输出和错误不混合每个帮助您的工作。

为避免发生碰撞/取消时的再处理，请使用--joblog和--resume。

来源

2016-10-01 10:07:05

非常感谢 - 我将在上面的脚本中添加锁定，以使其线程安全。 – Julian

在做这件事之前，请考虑阅读GNU Parallel的教程：它可以为您节省大量时间。 man parallel_tutorial –

谢谢Ole。我已根据您的建议和gnu平行文档更改了脚本。我在https://gist.github.com/julianthome/161e6734c36611fcf03c91c9f76ebd5a上提供 – Julian

通过将命令行工具并入gash信号量的bash脚本中并行化

回答

相关问题