2017-08-25 109 views
0

我有包含许多行的文件,如下面:在awk/GSUB替代的特殊字符和字符串的提取

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">XX:The quick brown fox jumped over the lazy </a> -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png" alt="validate"> - user 

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">YY:Jack and Jill went up the hill </a> -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png" alt="validate"> - user 

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">ZZ: Mary had a little lamb </a> -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png" alt="validate"> - user 

我希望提取以下字符串,并丢弃一切。

XX: The quick brown fox jumped over the lazy 
YY: Jack and Jill went up the hill 
ZZ: Mary had a little lamb 

到目前为止,我已经使用以下awk命令尝试,但它似乎被限制为XX需要更换的YY和ZZ。

awk '{gsub(/^.*XX:/,"XX:"); gsub(/[<\a>].*$/,"[</a>].");print}' 

有没有其他人可以建议使用任何其他标准的Linux工具? 谢谢。

+0

XX/YY/ZZ的通用性如何?如果是这样,你可以在大多数正则表达式引擎中执行'[XYZ] {2}''。 – stevesliva

+0

@stevesliva,我认为问题是更多(或也),OP必须改变替换字符串以及哪些字母匹配正则表达式。 – jas

+0

嗨,Jas是正确的,在':'之前替换字符串的变化将是一个要求..感谢您的回复 – niknak

回答

0

^.XX意味着any character followed by XX at the start of a line - 它不会匹配XX中线。 [<\a>]表示any of the characters <, \, a, or > - 它不会匹配字符串<\a>。找到一个正则表达式教程...

你的问题不清楚,但也许这就是你想要做的?

$ awk '{sub(/<\/a>.*/,""); sub(/.*>/,"")} NF' file 
XX:The quick brown fox jumped over the lazy 
YY:Jack and Jill went up the hill 
ZZ: Mary had a little lamb 

或GNU AWK的第三个参数匹配()打印...之间(假设每行一个)不管的:

$ awk 'match($0,/.*<a[^>]*>(.*)<\/a>.*/,a){print a[1]}' file 
XX:The quick brown fox jumped over the lazy 
YY:Jack and Jill went up the hill 
ZZ: Mary had a little lamb 

这在任何sed的是:

$ sed -n 's/.*<a[^>]*>\(.*\)<\/a>.*/\1/p' file 
XX:The quick brown fox jumped over the lazy 
YY:Jack and Jill went up the hill 
ZZ: Mary had a little lamb 
0

我猜,这Perl的一个班轮会做(看起来,你是在Linux上):

perl -lne 'print $1 if m{>((XX|YY|ZZ):[^<]*)}' 
+0

感谢所有答复将尝试所有的答复,并留下更新。非常感谢 – niknak

1

如果您的Input_file与所示示例相同,则以下内容也可能对您有所帮助。

awk -F"\">|</a>" 'NF{print $4}' Input_file 

说明:制作"></a>作为一个字段分隔符(显然得到什么OP需要:))。 NF将确保我们应该跳过空行。现在,当我们将字段分隔符设置为2时,我们可以看到第4个字段将是OP要求的字段,这里是我们如何看到所有字段的值,并且我们可以选择OP需要获得的第4列。

awk -F"\">|</a>" '{for(i=1;i<=NF;i++){print i,$i}}' Input_file 
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive 
2 <img src="img/in-event-40x40.png" alt="event 
3 - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html 
4 XX:The quick brown fox jumped over the lazy 
5 -<img src= "img/config-40x40.png" alt="config 
6 <img src="img/validate-40x50.png" alt="validate 
7 - user 
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive 
2 <img src="img/in-event-40x40.png" alt="event 
3 - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html 
4 YY:Jack and Jill went up the hill 
5 -<img src= "img/config-40x40.png" alt="config 
6 <img src="img/validate-40x50.png" alt="validate 
7 - user 
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive 
2 <img src="img/in-event-40x40.png" alt="event 
3 - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html 
4 ZZ: Mary had a little lamb 
5 -<img src= "img/config-40x40.png" alt="config 
6 <img src="img/validate-40x50.png" alt="validate 
7 - user 

我希望这会有所帮助。

+3

在shell中,除非需要双引号(例如让变量扩展),否则应该在字符串周围使用单引号。如果你遵循这个规则,那么它有很好的一面好处,在这种情况下,当你设置FS时,你不需要跳过双引号,而不是'-F“\”> | “”'你应该写'-F'“> |''。 –