2013-03-01 105 views
2

这是我的问题。我是西班牙语翻译员,我有一个非常冗长的西班牙语 - 英语词汇表文件 - 50K条目。另外,我有一个超过1K条目的停用词汇表。我想从我打算翻译的文本中去除这些条目。所以,我构建了一个sed脚本,它反过来从词汇表中构建了两个sed脚本,这些脚本完成了剥离操作,并且只留下未翻译的文本(所以我不需要两次解决相同的问题)。这很有效,但问题在于长文本需要很长时间,有时会长达15分钟。这是不可避免的,还是有一种更有效的方式来做到这一点?sed语言翻译脚本 - 提高长文本效率

这里的主脚本:

#!/bin/sh 
before="$(date +%s)" 

#wordstxt=$(wc -w < $1) 
#mintime=$(expr "$wordstxt/200" |bc -l) 
#maxtime=$(expr "$wordstxt/175" |bc -l) 
#echo "Estimated time to process: between $mintime and $maxtime seconds." 
sed ' 
s/\,/\n/g   # strip all commas 
s/\?/\n/g  # strip question marks 
s/\*/\n/g  # strip asterisks 
s/\!/\n/g   # strip exclamation marks 
s/:/\n/g   # strip colons 
s/\-/\n/g   # strip hyphens 
s/\./\n/g   # strip periods 
s/«/\n/g   # strip left Euro-quotes 
s/»/\n/g   # strip right Euro-quotes 
s/”/\n/g   # strip slanted US quotes 
s/\"/\n/g  # strip left quotes 
s/(/\n/g   # strip left paren 
s/)/\n/g   # strip right paren 
s/\[/\n/g   # strip left bracket 
s/\]/\n/g   # strip right bracket 
s/¿/\n/g   # "¿" 
s/—/\n/g  # m-dash 
s/\ –\ /\n/g  # n-dash 
s/…/\n/g  # strip elipsis as a single character, not three periods 
s/;/\n/g   # strip semicolon 
s/[0-9]/\n/g  # strip out all numbers, replace with returns 
' $1 > $1.z.tmp 
#echo "Punctuation eliminated." 

#cp ../../Spanish\ to\ English\ projects/glossary/stoplist.txt . 
sed ' 
s/^\ //g  # strip leading spaces 
s/\ $//   # strip trailing spaces 
/^$/d   # delete blank lines 
s/\./\n/g  # strip periods 
s/\ /\\ /g  # make spaces into literals 
s/^/s\//  # begins the substitution 
s/$/\/\\n\/g/ # concludes the substitution 

1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/ 

' stoplist.txt > stoplist.sed 
chmod +x stoplist.sed 
echo "Eliminating stopwords." 
./stoplist.sed $1.z.tmp > $1.0.tmp 

sed 's/\([A-Za-z\ ]*\t\).*/\1/' SpanishGlossary.utf8 > tempgloss.2.txt 
#echo "Target phrases stripped." 

sort -u tempgloss.2.txt > tempgloss.3.txt 

awk '{ print length(), $0 | "sort -rn" }' tempgloss.3.txt > tempgloss.4.txt 
#echo "List ordered by length." 

#echo "Now creating new sed script." # THIS AFFECTS THE SED SCRIPT, NOT THE OUTPUT FILE. 

sed ' 
s/[0-9]//g  # strip out all numbers 
s/^\ //g  # strip leading spaces -- all lines have them due to the sort 
/^$/d   # delete blank lines 
s/\//\\\//g  # make text slashes into literals 
s/"/\n/g   # strip quotes 
s/\t//g   # strip tabs 
s/\./\n/g  # strip periods 
s/'\''/\\'\''/g  # make straight apostrophes into literals 
s/'\’'/\\'\’'/g  # make curly apostrophes into literals 
s/\ /\\ /g  # make spaces into literals 
/^.\{0,5\}$/d  # delete lines with less than five characters 
s/^/s\/\\b/  # begins the substitution 
s/$/\\b\/\\n\/g/ # concludes the substitution 

1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/ 

' tempgloss.4.txt > glossy.sed 

#echo "glossy.sed created." 
chmod +x glossy.sed 

echo "Eliminating existing entries. This may take a while." 
./glossy.sed $1.0.tmp > $1.1.tmp 

echo "Now cleaning up lines." 
sed -e ' 
s/\ $//   # strip trailing spaces 
s/^\ *//g  # strip any and all leading spaces 
s/\ el$//g  # strip "el" from the end 
s/\ la$//g  # strip "la" from the end 
s/\ los//g  # strip "los" from the end 
s/\ las//g  # strip "las" from the end 
s/\ o$//g  # strip "o" from the end 
s/\ y$//g  # strip "y" from the end 
s/\ $//   # strip trailing spaces (yes, again) 
' $1.1.tmp > $1.2.tmp 

echo "Creating ngrams." 
./ngrams 5 < $1.2.tmp > $1.3.tmp 2> /dev/null 

linecount="$(wc -l < $1.3.tmp)" 
#echo $linecount "lines." 
if [ "$linecount" -gt "1000" ] 
then 
    echo "Eliminating single instances." 
    sed '/^1\t/d' $1.3.tmp > $1.4.tmp 
else 
    echo "Fewer than 1000 entries, so keeping all." 
    cp $1.3.tmp $1.4.tmp 
fi 

sed -e ' 
s/[0-9]//g  # strip out all numbers 
s/^\t//g   # strip leading tab 
s/^\ *//g  # strip any and all leading spaces 
/^.\{0,7\}$/d  # delete lines with less than six characters 
s/\ $//   # strip trailing spaces (yes, again) 
#s/$/\t/   # add in the tab 
' $1.4.tmp > $1.csv 

echo "Looking for duplicates." 
sh ./dedupe $1.csv 

wordstxt=$(wc -w < $1) 
#echo $wordstxt 
wordslist=$(wc -w < $1.csv) 
#echo $wordslist 
wordspercent=$(echo "scale=4; $wordslist/$wordstxt" |bc -l) 
wordspercentage=$(echo "$wordspercent * 100" |bc -l) 


after="$(date +%s)" 
elapsed_seconds="$(expr $after - $before)" 
rate=$(echo "scale=3; $wordstxt/$elapsed_seconds" |bc -l) 
echo "Created "$1.csv", with $wordspercentage% left, in" $elapsed_seconds "seconds." #, for an effective rate of" $rate "words per second." 

rm tempgloss.*.txt 
rm *.tmp 
rm glossy.sed 
+0

有趣的问题,但我没有时间重写你的脚本。其他人可能会。你可以结合像s/\ el $ | \ los | \ la $ //'这样的单词替换。对于包含行尾标记'$'的字符串使用'/ g'可能不会花费额外的时间,但会让其他人更难理解您的代码。你也可以一次对许多单个字符进行分割,比如's/[,?\ *!: - \。]/\ n/g',但是使用'[character-class]'范围会引起混淆。祝你好运。 – shellter 2013-03-02 02:10:44

+0

感谢您的提示。即使在我发布这篇文章之后,我将标点符号从脚本的顶部拖出,并将其放入了停用词列表中。你谈论的组合有没有什么优势?拥有一条超级巨大的路线,而不是成千上万的小路线? – user1889034 2013-03-02 02:44:33

+0

是的,一条线的每次扫描花费你x次。使用包含例如5个ORed表达式(使用'|')的reg ex将时间减少到〜x/5次。我不会试图在's/wd1 | wd2 /'行上拼写每一个可能的单词,你会在调试sed错误消息所需的时间内达到递减的回报点。使它成为替换组合相关的单词,以便您的代码更易于维护。可能还有其他一些技巧可以减少整体运行时间。有时,管道中的命令越多越好,但现在不能说。祝你好运。 – shellter 2013-03-02 02:53:11

回答

0

重写awk的脚本,它会在几秒钟内运行分钟,而不是和被更简单,更清晰。 sed是简单替换单行的优秀工具。对于其他任何东西,只需使用awk。

+0

我真的很喜欢这个想法,但我无法弄清楚如何将awk应用于散文。我要问这是一个单独的问题。谢谢! – user1889034 2013-03-03 00:54:51

0

您可以组合许多这些,对于也许更快的速度

s/[\,\?\*\!:\-\.]/\n/g