因为HTML是不平坦的文本格式,包括平面的文字工具,如grep
处理它, sed
或awk
是不可取的。如果HTML的格式略有变化(例如:如果span
节点获得另一个属性或者在某处插入换行符),那么以这种方式构建的任何内容都有可能中断。
它更健壮(如果更费力)使用构建解析HTML的东西。在这种情况下,我会考虑使用Python,因为它的标准库中有一个(基本的)HTML解析器。它可能看起来大致是这样的:
#!/usr/bin/python3
import html.parser
import re
import sys
# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
def __init__(self):
super(my_parser, self).__init__(self)
self.data = ''
self.depth = 0
# handle opening tags. Start counting, assembling content when a
# span tag begins whose id is "wob_hm". A depth counter is maintained
# largely to handle nested span tags, which is not strictly necessary
# in your case (but will make this easier to adapt for other things and
# is not more complicated to implement than a flag)
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
# handle end tags. Make sure the depth counter is only positive
# as long as we're in the span tag we want
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
# when data comes, assemble it in a string. Note that nested tags would
# not be recorded by this if they existed. It would be more work to
# implement that, and you don't need it for this.
def handle_data(self, data):
if self.depth > 0:
self.data += data
# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
# instantiate our parser
p = my_parser()
# then feed it the file. If the file is not UTF-8, it is necessary to
# convert the file contents to UTF-8. I'm assuming latin1-encoded
# data here; since the example looks German, "latin9" might also be
# appropriate. Use the encoding in which your data is encoded.
p.feed(f.read().decode("latin1"))
# trim (in case of newlines/spaces around the data), remove % at the end,
# then print
print(re.compile('%$').sub('', p.data.strip()))
附录:这里有一个反向移植到Python 2中bulldozes就在编码问题。对于这种情况,这可以说是更好,因为编码对于我们想要提取的数据无关紧要,并且您不必事先知道输入文件的编码。这些变化是微不足道的,它的工作方式是完全相同的:
#!/usr/bin/python
from HTMLParser import HTMLParser
import re
import sys
class my_parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = ''
self.depth = 0
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
def handle_data(self, data):
if self.depth > 0:
self.data += data
with open(sys.argv[1], "r") as f:
p = my_parser()
p.feed(f.read())
print(re.compile('%$').sub('', p.data.strip()))
您是否愿意使用适当的HTML解析器解决方案?这可以使用正则表达式,但是学习使用类似perl/python的东西来解决这些问题会好得多。 – 2015-04-04 17:45:58
Obligatory [不要使用正则表达式解析(x)html](http://stackoverflow.com/a/1732454/7552)链接。 – 2015-04-04 17:57:14