使用Python从HTML网站提取多行数据

因此，只要我匹配的内容不会超过1行，如果它跨越多于1行，我就有胃灼热貌似）......这里的HTML数据的片断我得到：使用Python从HTML网站提取多行数据

<tr> 
<td width=20%>3 month 
<td width=1% class=bar> 
&nbsp; 
<td width=1% nowrap class="value chg">+10.03% 
<td width=54% class=bar> 
<table width=100% cellpadding=0 cellspacing=0 class=barChart> 
<tr>

我感兴趣的是“+ 10.03％”号和

<td width=20%>3 month

的是，让我知道格局“+ 10.03％”是我想要的。

所以我在Python得到这个至今：

percent = re.search('<td width=20%>3 month\r\n<td width=1% class=bar>\r\n&nbsp;\r\n<td width=1% nowrap class="value chg">(.*?)', content)

其中变量的内容拥有所有的HTML代码，我在寻找。这似乎不适用于我...任何意见将不胜感激！我读过一些其他职位，谈论re.compile（）和re.multiline（），但我没有任何运气，他们主要是因为我不明白他们是如何工作，我猜...

来源

2013-10-09 skbeez

不要使用正则表达式来解析HTML。它总是以心痛结束。 – tehsockz

不要使用正则表达式来解析HTML。这是一个糟糕的主意，因为它可能会很快变得复杂。使用类似['HTMLParser']（http://docs.python.org/2/library/htmlparser.html）。 –

所以我尝试HTMLParser但BeautifulSoup似乎更好地工作...（HTMLParser返回一个错误的标签错误），但我有点困惑如何让它来搜索我的10.03％的数字..我搜索 skbeez

感谢大家的帮助！您指出我正确的方向，这是我如何让我的代码与BeautifulSoup一起工作。我注意到，所有我想要的数据是一个名为“值CHG”，其次是类之下，我的数据总是在搜索的第3和第5个元素，所以这是我做过什么：

from BeautifulSoup import BeautifulSoup 
import urllib 

content = urllib.urlopen(url).read() 
soup = BeautifulSoup(''.join(content)) 

td_list = soup.findAll('td', {'class':'value chg'}) 

mon3 = td_list[2].text.encode('ascii','ignore') 
yr1 = td_list[4].text.encode('ascii','ignore')

再次，“内容“是我下载的HTML。

来源

2013-10-09 07:15:12 skbeez

您需要添加”多行“正则表达式开关(?m)。您可以通过findall(regex, content)[0]直接提取使用findall并采取本场比赛的第一个元素的目标内容：

percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]

通过使用\s*匹配换行符，正则表达式是UNIX和Windows风格的行终止兼容。

请参见下面的测试代码的live demo：

import re 
content = '<tr>\n<td width=20%>3 month\n<td width=1% class=bar>\n&nbsp;\n<td width=1% nowrap class="value chg">+10.03%\n<td width=54% class=bar>\n<table width=100% cellpadding=0 cellspacing=0 class=barChart>\n<tr>'   
percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0] 
print(percent)

输出：

+10.03%

来源

2013-10-09 13:38:19 Bohemian

使用Python从HTML网站提取多行数据

回答

相关问题