的grep/SED/AWK - 摘自HTML代码

-2

串我想从HTML代码中这样的值：的grep/SED/AWK - 摘自HTML代码

<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:

至于结果，我只需要值： “53”

哪有这可以使用像grep，awk或sed这样的Linux命令行工具来完成。我想用它在树莓派...... [R

尝试这并不工作：

[email protected]:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt 
[email protected]:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt 
[email protected]:/home/pi#

来源

2015-04-04 fhammer

您是否愿意使用适当的HTML解析器解决方案？这可以使用正则表达式，但是学习使用类似perl/python的东西来解决这些问题会好得多。 – 2015-04-04 17:45:58

Obligatory [不要使用正则表达式解析（x）html]（http://stackoverflow.com/a/1732454/7552）链接。 – 2015-04-04 17:57:14

因为HTML是不平坦的文本格式，包括平面的文字工具，如grep处理它， sed或awk是不可取的。如果HTML的格式略有变化（例如：如果span节点获得另一个属性或者在某处插入换行符），那么以这种方式构建的任何内容都有可能中断。

它更健壮（如果更费力）使用构建解析HTML的东西。在这种情况下，我会考虑使用Python，因为它的标准库中有一个（基本的）HTML解析器。它可能看起来大致是这样的：

#!/usr/bin/python3 

import html.parser 
import re 
import sys 

# html.parser.HTMLParser provides the parsing functionality. It tokenizes 
# the HTML into tags and what comes between them, and we handle them in the 
# order they appear. With XML we would have nicer facilities, but HTML is not 
# a very good format, so we're stuck with this. 
class my_parser(html.parser.HTMLParser): 
    def __init__(self): 
     super(my_parser, self).__init__(self) 
     self.data = '' 
     self.depth = 0 

    # handle opening tags. Start counting, assembling content when a 
    # span tag begins whose id is "wob_hm". A depth counter is maintained 
    # largely to handle nested span tags, which is not strictly necessary 
    # in your case (but will make this easier to adapt for other things and 
    # is not more complicated to implement than a flag) 
    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    # handle end tags. Make sure the depth counter is only positive 
    # as long as we're in the span tag we want 
    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    # when data comes, assemble it in a string. Note that nested tags would 
    # not be recorded by this if they existed. It would be more work to 
    # implement that, and you don't need it for this. 
    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

# open the file whose name is the first command line argument. Do so as 
# binary to get bytes from f.read() instead of a string (which requires 
# the data to be UTF-8-encoded) 
with open(sys.argv[1], "rb") as f: 
    # instantiate our parser 
    p = my_parser() 

    # then feed it the file. If the file is not UTF-8, it is necessary to 
    # convert the file contents to UTF-8. I'm assuming latin1-encoded 
    # data here; since the example looks German, "latin9" might also be 
    # appropriate. Use the encoding in which your data is encoded. 
    p.feed(f.read().decode("latin1")) 

    # trim (in case of newlines/spaces around the data), remove % at the end, 
    # then print 
    print(re.compile('%$').sub('', p.data.strip()))

附录：这里有一个反向移植到Python 2中bulldozes就在编码问题。对于这种情况，这可以说是更好，因为编码对于我们想要提取的数据无关紧要，并且您不必事先知道输入文件的编码。这些变化是微不足道的，它的工作方式是完全相同的：

#!/usr/bin/python 

from HTMLParser import HTMLParser 
import re 
import sys 

class my_parser(HTMLParser): 
    def __init__(self): 
     HTMLParser.__init__(self) 
     self.data = '' 
     self.depth = 0 

    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

with open(sys.argv[1], "r") as f: 
    p = my_parser() 
    p.feed(f.read()) 
    print(re.compile('%$').sub('', p.data.strip()))

来源

2015-04-04 18:56:17 Wintermute

Thx为答案，但尝试这我得到：根@ raspberrypi：/ home/pi/grep谷歌天气＃./test.py test.html 追溯（最近呼叫最后）：文件“。 /test.py“，第46行，在 p.feed（f.read（））文件”/usr/lib/python3.2/codecs.py“，第300行，解码为（result，consume） = self._buffer_decode（data，self.errors，final） UnicodeDecodeError：'utf-8'编解码器无法解码16022位的字节0xfc：无效的起始字节 – fhammer 2015-04-05 01:04:28

iso8859编码的数据，eh？查看编辑。在将文件内容传递给html.parser.HTMLParser之前，这是一个小小的改变，它显然需要Python 3中的UTF-8。我稍后可能会回过头来将它移植到Python 2中，我认为这会处理更多优雅，但我在此之前需要睡眠。 – Wintermute 2015-04-05 01:25:19

呃，我马上做了python 2 backport。原来需要几乎没有改变，并且python 2'HTMLParser'具有（对于这种情况）不关心编码的好性质。老实说，我有点恼火，那是在python 3中没有替换的情况下被删除的。 – Wintermute 2015-04-05 01:35:14

的grep/SED/AWK - 摘自HTML代码

回答

相关问题