从HTML表格提取数据

我正在寻找一种方法在Linux shell环境中从HTML获取某些信息。从HTML表格提取数据

这是我感兴趣的一点：

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
    <th>Tests</th> 
    <th>Failures</th> 
    <th>Success Rate</th> 
    <th>Average Time</th> 
    <th>Min Time</th> 
    <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>

而且我想在shell变量存储或从上面的html中提取键值对这些呼应。例如：

Tests   : 103 
Failures  : 24 
Success Rate : 76.70 % 
and so on..

我可以在此刻要做的就是创建使用SAX解析器或HTML解析器如jsoup提取此信息的Java程序。

但是，在这里使用java似乎是在你要执行的“包装器”脚本中包含可运行jar的开销。

我敢肯定，必须有“壳”的语言，有可以做同样的也就是Perl，Python和庆典等

我的问题是，我有这些零经验，能够有人帮助我解决这个“相当简单”的问题

快速更新：

我忘了提，我的html的文件有关（清晨）对不起在得到了更多的表和更多的行。

更新＃2：

试图安装Bsoup这样的，因为我没有root访问权限：

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz 
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz 
$ cp -r beautifulsoup4-4.1.0/bs4 . 
$ vi htmlParse.py # (paste code from) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted 
$ run file (python htmlParse.py)

错误：

$ python htmlParse.py 
Traceback (most recent call last): 
    File "htmlParse.py", line 1, in ? 
    from bs4 import BeautifulSoup 
    File "/home/gdd/setup/py/bs4/__init__.py", line 29 
    from .builder import builder_registry 
     ^
SyntaxError: invalid syntax

更新＃ 3：

运行Tichodromas的回答得到这个错误：

Traceback (most recent call last): 
    File "test.py", line 27, in ? 
    headings = [th.get_text() for th in table.find("tr").find_all("th")] 
TypeError: 'NoneType' object is not callable

什么想法？

来源

2012-08-03 Gandalf StormCrow

有一个不错的python库可以帮助：BeautifulSoup - > http://www.crummy.com/software/BeautifulSoup/bs4/doc/。 – 2012-08-03 06:53:05

@Jakob S.谢谢你的评论，因为我告诉你我是新手，所以我下载了tarbal并试图安装它'python setup.py install'得到这个权限错误'错误：无法创建'/ usr/lib/python2.4/site-packages/bs4'：Permission denied'，如何指定安装它的目录。在安装其他命令时是否有类似于“-prefix”的内容 – 2012-08-03 07:06:28

我不得不承认，如果您没有root访问权限，我不知道如何实现这一目标 - 并且此刻我还没有Linux。原则上，应该可以简单地将软件包复制到与源.py文件相关的正确目录中，以便解释程序可以找到它。 – 2012-08-03 07:14:36

一个Python溶液。EDIT3：使用class="details"选择table）：

from bs4 import BeautifulSoup 

html = """ 
    <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
     <th>Tests</th> 
     <th>Failures</th> 
     <th>Success Rate</th> 
     <th>Average Time</th> 
     <th>Min Time</th> 
     <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>""" 

soup = BeautifulSoup(html) 
table = soup.find("table", attrs={"class":"details"}) 

# The first tr contains the field names. 
headings = [th.get_text() for th in table.find("tr").find_all("th")] 

datasets = [] 
for row in table.find_all("tr")[1:]: 
    dataset = zip(headings, (td.get_text() for td in row.find_all("td"))) 
    datasets.append(dataset) 

print datasets

结果看起来是这样的：

[[(u'Tests', u'103'), 
    (u'Failures', u'24'), 
    (u'Success Rate', u'76.70%'), 
    (u'Average Time', u'71 ms'), 
    (u'Min Time', u'0 ms'), 
    (u'Max Time', u'829 ms')]]

EDIT2：要产生所需的输出，使用这样的：

for dataset in datasets: 
    for field in dataset: 
     print "{0:<16}: {1}".format(field[0], field[1])

结果：

Tests   : 103 
Failures  : 24 
Success Rate : 76.70% 
Average Time : 71 ms 
Min Time  : 0 ms 
Max Time  : 829 ms

来源

2012-08-03 07:15:55

感谢您的回答，回答您的意见上面。我可以使用该类作为标识符，我没有ID？class将是'details' – 2012-08-03 07:41:00

@G andalfStormCrow是的，这可以完成。我编辑了我的答案。 – 2012-08-03 07:46:26

这个答案确实可以在Python 2.4中起作用吗？ @Gandalf，你在评论中说你安装了“旧版本的bsoup”（我假设BeautifulSoup 3）。说“我正在使用Python 2.4.3”的行消失了。所以这有点令人困惑。 – mzjn 2012-08-03 11:18:12

undef $/; 
$text = <DATA>; 

@tabs = $text =~ m!<table.*?>(.*?)</table>!gms; 
for (@tabs) { 
    @th = m!<th>(.*?)</th>!gms; 
    @td = m!<td>(.*?)</td>!gms; 
} 
for $i (0..$#th) { 
    printf "%-16s\t: %s\n", $th[$i], $td[$i]; 
} 

__DATA__ 
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
<tr valign="top"> 
<th>Tests</th> 
<th>Failures</th> 
<th>Success Rate</th> 
<th>Average Time</th> 
<th>Min Time</th> 
<th>Max Time</th> 
</tr> 
<tr valign="top" class="Failure"> 
<td>103</td> 
<td>24</td> 
<td>76.70%</td> 
<td>71 ms</td> 
<td>0 ms</td> 
<td>829 ms</td> 
</tr> 
</table>

输出如下：适当跳跃：使用BeautifulSoup4（编辑

Tests : 103 Failures : 24 Success Rate : 76.70% Average Time : 71 ms Min Time : 0 ms Max Time : 829 ms

来源

2012-08-03 06:56:47 cdtits

我建议[使用XML解析器]（http://stackoverflow.com/a/1732454/647772）。 – 2012-08-03 06:57:29

@cdtits感谢您的回应，请问工作，如果我的文件包含多个表？ – 2012-08-03 07:06:53

仅使用标准库的Python解决方案（利用了HTML恰好是格式良好的XML这一事实）。可以处理多行数据。

（测试使用Python 2.6和2.7此问题已更新说，OP使用Python 2.4，所以这个答案可能不是在这种情况下非常有用的。在Python 2.5中加入的ElementTree）

from xml.etree.ElementTree import fromstring 

HTML = """ 
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
    <th>Tests</th> 
    <th>Failures</th> 
    <th>Success Rate</th> 
    <th>Average Time</th> 
    <th>Min Time</th> 
    <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
    <tr valign="top" class="whatever"> 
    <td>A</td> 
    <td>B</td> 
    <td>C</td> 
    <td>D</td> 
    <td>E</td> 
    <td>F</td> 
    </tr> 
</table>""" 

tree = fromstring(HTML) 
rows = tree.findall("tr") 
headrow = rows[0] 
datarows = rows[1:] 

for num, h in enumerate(headrow): 
    data = ", ".join([row[num].text for row in datarows]) 
    print "{0:<16}: {1}".format(h.text, data)

输出：

Tests   : 103, A 
Failures  : 24, B 
Success Rate : 76.70%, C 
Average Time : 71 ms, D 
Min Time  : 0 ms, E 
Max Time  : 829 ms, F

来源

2012-08-03 07:39:27 mzjn

谢谢你的回答。我可以这样指定，而不是从一个特定的html字符串中读取：从这个html文件中得到一个包含'class =“details”'的表并且执行刚刚完成的操作？ – 2012-08-03 07:42:30

这只适用于包含'td'的*一行*行。 – 2012-08-03 07:49:26

现在它可以处理多个数据行。我已经用Python 2.6和2.7测试过了，但现在我发现你使用2.4.3（我没有）。所以它可能无法帮助你。无论如何，我想表明没有额外的图书馆就可以做这种事情。 – mzjn 2012-08-03 08:56:13

假设你的HTML代码存储在mycode.html文件，这里是一个bash方式：

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

注：输出是不完全一致

来源

2012-08-03 07:53:37

感谢您的回答，我需要得到特定的表格，有多个表格 – 2012-08-03 07:59:46

我听说用正则表达式解析HTML或XML被定义中断。 – ychaouche 2014-01-12 14:36:44

这里是顶级的答案，适合Python3兼容性，提高了通过剥离空白单元格：

from bs4 import BeautifulSoup 

html = """ 
    <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
     <th>Tests</th> 
     <th>Failures</th> 
     <th>Success Rate</th> 
     <th>Average Time</th> 
     <th>Min Time</th> 
     <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>""" 

soup = BeautifulSoup(s, 'html.parser') 
table = soup.find("table") 

# The first tr contains the field names. 
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")] 

print(headings) 

datasets = [] 
for row in table.find_all("tr")[1:]: 
    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td")))) 
    datasets.append(dataset) 

print(datasets)

来源

2017-05-31 04:07:55

下面是一个Python正则表达式基础的解决方案，我有在Python 2.7上测试。它不依赖于xml模块 - 所以在xml格式不完整的情况下工作。

import re 
# input args: html string 
# output: tables as a list, column max length 
def extract_html_tables(html): 
    tables=[] 
    maxlen=0 
    rex1=r'<table.*?/table>' 
    rex2=r'<tr.*?/tr>' 
    rex3=r'<(td|th).*?/(td|th)>' 
    s = re.search(rex1,html,re.DOTALL) 
    while s: 
    t = s.group() # the table 
    s2 = re.search(rex2,t,re.DOTALL) 
    table = [] 
    while s2: 
     r = s2.group() # the row 
     s3 = re.search(rex3,r,re.DOTALL) 
     row=[] 
     while s3: 
     d = s3.group() # the cell 
     #row.append(strip_tags(d).strip()) 
     row.append(d.strip()) 

     r = re.sub(rex3,'',r,1,re.DOTALL) 
     s3 = re.search(rex3,r,re.DOTALL) 

     table.append(row) 
     if maxlen<len(row): 
     maxlen = len(row) 

     t = re.sub(rex2,'',t,1,re.DOTALL) 
     s2 = re.search(rex2,t,re.DOTALL) 

    html = re.sub(rex1,'',html,1,re.DOTALL) 
    tables.append(table) 
    s = re.search(rex1,html,re.DOTALL) 
    return tables, maxlen 

html = """ 
    <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
     <th>Tests</th> 
     <th>Failures</th> 
     <th>Success Rate</th> 
     <th>Average Time</th> 
     <th>Min Time</th> 
     <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>""" 
print extract_html_tables(html)

来源

2017-10-05 03:35:53 paolov

从HTML表格提取数据

回答

相关问题