2015-04-04 43 views
-2

串我想从HTML代码中这样的值:的grep/SED/AWK - 摘自HTML代码

<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind: 

至于结果,我只需要值: “53”

哪有这可以使用像grep,awk或sed这样的Linux命令行工具来完成。我想用它在树莓派...... [R

尝试这并不工作:

[email protected]:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt 
[email protected]:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt 
[email protected]:/home/pi# 
+2

您是否愿意使用适当的HTML解析器解决方案?这可以使用正则表达式,但是学习使用类似perl/python的东西来解决这些问题会好得多。 – 2015-04-04 17:45:58

+2

Obligatory [不要使用正则表达式解析(x)html](http://stackoverflow.com/a/1732454/7552)链接。 – 2015-04-04 17:57:14

回答

0

因为HTML是不平坦的文本格式,包括平面的文字工具,如grep处理它, sedawk是不可取的。如果HTML的格式略有变化(例如:如果span节点获得另一个属性或者在某处插入换行符),那么以这种方式构建的任何内容都有可能中断。

它更健壮(如果更费力)使用构建解析HTML的东西。在这种情况下,我会考虑使用Python,因为它的标准库中有一个(基本的)HTML解析器。它可能看起来大致是这样的:

#!/usr/bin/python3 

import html.parser 
import re 
import sys 

# html.parser.HTMLParser provides the parsing functionality. It tokenizes 
# the HTML into tags and what comes between them, and we handle them in the 
# order they appear. With XML we would have nicer facilities, but HTML is not 
# a very good format, so we're stuck with this. 
class my_parser(html.parser.HTMLParser): 
    def __init__(self): 
     super(my_parser, self).__init__(self) 
     self.data = '' 
     self.depth = 0 

    # handle opening tags. Start counting, assembling content when a 
    # span tag begins whose id is "wob_hm". A depth counter is maintained 
    # largely to handle nested span tags, which is not strictly necessary 
    # in your case (but will make this easier to adapt for other things and 
    # is not more complicated to implement than a flag) 
    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    # handle end tags. Make sure the depth counter is only positive 
    # as long as we're in the span tag we want 
    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    # when data comes, assemble it in a string. Note that nested tags would 
    # not be recorded by this if they existed. It would be more work to 
    # implement that, and you don't need it for this. 
    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

# open the file whose name is the first command line argument. Do so as 
# binary to get bytes from f.read() instead of a string (which requires 
# the data to be UTF-8-encoded) 
with open(sys.argv[1], "rb") as f: 
    # instantiate our parser 
    p = my_parser() 

    # then feed it the file. If the file is not UTF-8, it is necessary to 
    # convert the file contents to UTF-8. I'm assuming latin1-encoded 
    # data here; since the example looks German, "latin9" might also be 
    # appropriate. Use the encoding in which your data is encoded. 
    p.feed(f.read().decode("latin1")) 

    # trim (in case of newlines/spaces around the data), remove % at the end, 
    # then print 
    print(re.compile('%$').sub('', p.data.strip())) 

附录:这里有一个反向移植到Python 2中bulldozes就在编码问题。对于这种情况,这可以说是更好,因为编码对于我们想要提取的数据无关紧要,并且您不必事先知道输入文件的编码。这些变化是微不足道的,它的工作方式是完全相同的:

#!/usr/bin/python 

from HTMLParser import HTMLParser 
import re 
import sys 

class my_parser(HTMLParser): 
    def __init__(self): 
     HTMLParser.__init__(self) 
     self.data = '' 
     self.depth = 0 

    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

with open(sys.argv[1], "r") as f: 
    p = my_parser() 
    p.feed(f.read()) 
    print(re.compile('%$').sub('', p.data.strip())) 
+0

Thx为答案,但尝试这我得到: 根@ raspberrypi:/ home/pi/grep谷歌天气#./test.py test.html 追溯(最近呼叫最后): 文件“。 /test.py“,第46行,在 p.feed(f.read()) 文件”/usr/lib/python3.2/codecs.py“,第300行,解码为 (result,consume) = self._buffer_decode(data,self.errors,final) UnicodeDecodeError:'utf-8'编解码器无法解码16022位的字节0xfc:无效的起始字节 – fhammer 2015-04-05 01:04:28

+0

iso8859编码的数据,eh?查看编辑。在将文件内容传递给html.parser.HTMLParser之前,这是一个小小的改变,它显然需要Python 3中的UTF-8。我稍后可能会回过头来将它移植到Python 2中,我认为这会处理更多优雅,但我在此之前需要睡眠。 – Wintermute 2015-04-05 01:25:19

+0

呃,我马上做了python 2 backport。原来需要几乎没有改变,并且python 2'HTMLParser'具有(对于这种情况)不关心编码的好性质。老实说,我有点恼火,那是在python 3中没有替换的情况下被删除的。 – Wintermute 2015-04-05 01:35:14