2017-02-24 39 views
1

我只是试图从这样一个网页得到一些数据:获得从HTML页面数据成Python阵列

[ . . . ] 

<p class="special-large">Lorem Ipsum 01</p> 
<p class="special-large">Lorem Ipsum 02</p> 
<p class="special-large">Lorem Ipsum 03</p> 
<p class="special-large">Lorem Ipsum 04</p> 
<p class="special-large">Lorem Ipsum 05</p> 

[ . . . ] 

我想有一个python阵列类似以下:

myArrayWebPage = ["Lorem Ipsum 01","Lorem Ipsum 02","Lorem Ipsum 03","Lorem Ipsum 04","Lorem Ipsum 05"] 

这是我的Python脚本:

import urllib.request 

urlAddress = "http:// ... /" # my url address 
getPage = urllib.request.urlopen(urlAddress) 
outputPage = getPage.read() 
print(outputPage) 

我怎样才能从 “outputPage” 的阵列?

回答

1

这似乎做你想要什么:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32 
Type "copyright", "credits" or "license()" for more information. 
>>> html = '''<p class="special-large">Lorem Ipsum 01</p> 
<p class="special-large">Lorem Ipsum 02</p> 
<p class="special-large">Lorem Ipsum 03</p> 
<p class="special-large">Lorem Ipsum 04</p> 
<p class="special-large">Lorem Ipsum 05</p>''' 
>>> import re 
>>> re.findall('<p class="special-large">([^<]+)</p>', html) 
['Lorem Ipsum 01', 'Lorem Ipsum 02', 'Lorem Ipsum 03', 'Lorem Ipsum 04', 'Lorem Ipsum 05'] 
>>> 

请注意,regular expressions通常不优选这样的事情。您应该使用类似Beautiful Soup的库。

+0

谢谢!我能问你“正则表达式”是什么意思吗? –

+0

你可以点击现在的术语,维基百科的文章就会出现。下次尝试在Google上搜索您不熟悉的术语。 –

+0

@JoeHunter请借此机会阅读为什么正则表达式不足以解析HTML的疯狂有趣的答案:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-标签 –