Python的抓斗从一个HTML的所有链接，并只显示链接

我试图抢出标题使用网页的声明如下：Python的抓斗从一个HTML的所有链接，并只显示链接

titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)

利用这一点，我得到['random webpage example1']。我如何删除引号和括号？

使用该

我也想抓住一组每小时改变链接（这就是为什么我需要通配符）：links = re.findall(r'(file=(.*?).mp3)',the_webpage)。

我得到

[('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
    'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
    'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
    'http://media.kickstatic.com/kickapps/images/3380/audios/944521')]

我怎么没有file=的MP3链接？

我也想下载的MP3文件，并与该网站的标题追加他们，它会显示

random webpage example1.mp3

我将如何做到这一点？我仍然在学习Python和正则表达式，这有点让我感到困惑。

来源

2012-08-01 jokajinx

[正则表达式一般不用于解析XML一个很好的候选人/HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454）。您可能会发现[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/）有用 - 抓取所有链接就像“soup.find_all（'a'）”一样简单。看看[文档]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）。 – 2012-08-01 20:59:18

你应该看看更适合于URL解析的BeautifulSoup。 – xbb 2012-08-01 20:59:50

哦..你可能会发现这有助于格式化你的问题：http://stackoverflow.com/editing-help – 2012-08-01 21:02:09

至少对于部分1，你可以做

>>> mytitle = title1[0] 
>>> print mytitle 
random webpage example1

正则表达式将返回匹配的字符串列表，所以你只需要抓住列表中的第一项。

同样，对于第二部分，正则表达式返回里面有元组的列表。你可以这样做：

>>> download_links = [href for (discard, href) in links] 
>>> print download_links 
['http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521']

至于下载文件，使用urlib2（至少对于蟒蛇2.x的，不是蟒蛇3.x的肯定）。详情请参阅this question。

来源

2012-08-01 21:09:52 Michael0x2a

对于第一部分 titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)将返回一个列表，当您打印一个列表时，它会打印括号和引号。因此，如果您确定始终只有一场比赛，请尝试print title[0]。（您也可以尝试re.search代替）

对于第二部分，如果你从"(file=(.*?)\.mp3)"改变你重新图案"file=(.*?)\.mp3"你将只得到'http://linkInThisPart/path/etc/etc'部分，你将需要添加，虽然在.mp3扩展名。

i。Ë

audio_links = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',web_page)]

下载你可能要考虑的urllib文件，urllib2的

import urllib2 
url='http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3' 
req=urllib2.Request(url) 
temp_file=open('random webpage example1.mp3','wb') 
buffer=urllib2.urlopen(req).read() 
temp_file.write(buff) 
temp_file.close()

来源

2012-08-01 21:21:44 ffledgling

所以当我使用链接audio_links = [x +'。mp3'for x in re.findall（r 'file =（。*？）\。mp3'，web_page）]我得到的所有回报都是[''，''，''] – jokajinx 2012-08-03 13:15:49

标题很好，谢谢 – jokajinx 2012-08-03 13:17:15

试试只是'.'而不是'\ .'？ – ffledgling 2012-08-03 18:15:20

代码：

#!/usr/bin/env python 

import re,urllib,urllib2 

Url = "http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000" 
print Url 
print 'test .............' 
req = urllib2.Request(Url) 
print "1" 
response = urllib2.urlopen(req) 
print "2" 
the_webpage = response.read() 
print "3" 
titl1 = re.findall(r'<title>(.*?)</title>',the_webpage) 
print "4" 
a2 = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',the_webpage)] 
print "5" 
a2 = [x[0][5:] for x in a2] 
print "6" 
ti = titl1[0] 
print ti 
print "7" 
print a2 
print "8" 

print "9" 
#print the_page 
print "10" 

req=urllib2.Request(a2) 
print "11" 
temp_file=open(ti) 
print "12" 
buffer=urllib2.urlopen(req).read() 
print "13" 
temp_file.write(buff) 
print "14" 
temp_file.close() 
print "15" 
print "16"

结果

http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000 
test ............. 
1 
2 
3 
4 
5 
6 
Rick Ross - Sixteen (feat. Andre 3000) 
7 
['', '', ''] 
8 
9 
10 
Traceback (most recent call last): 
    File "grub.py", line 29, in <module> 
    req=urllib2.Request(a2) 
    File "/usr/lib/python2.7/urllib2.py", line 198, in __init__ 
    self.__original = unwrap(url) 
    File "/usr/lib/python2.7/urllib.py", line 1056, in unwrap 
    url = url.strip() 
AttributeError: 'list' object has no attribute 'strip'

来源

2012-08-03 18:40:08 jokajinx

尝试格式化您的代码。 – ffledgling 2012-08-06 02:45:53

的Python 3：

import requests 
import re 
from urllib.request import urlretrieve

- 首先获得HTML文本

html_text=requests.get('url')

- 正则表达式找到的网址

正则表达式，匹配（ '模式'，'文字'，flags）

在模式'（）'用于分组您想要的内容。在这种情况下，我们将“http：//*****.mp3”分组，并且可以使用.group（1）或groups（）引用它。

url_find=re.findall('file=(http://media.mp3*',html_text) 
for url_match in url_matches: 
    index += 1 
    print(url_match) 
    urlretrieve(url_match, './graber/mp3/user' + str(index) + '.mp3')

这就是我如何完成的，希望这会有所帮助。（下载东西有多种方法，在这种情况下，我使用urlretrieve）

来源

2017-06-01 05:14:12 tyrantqiao

Python的抓斗从一个HTML的所有链接，并只显示链接

回答

相关问题