2017-08-15 275 views
0

我使用BeautifulSoup来创建网页,并将其分配给'汤'。我可以通过在'site_url'末尾添加.text来获得文本'Aberdeen'。使用Python中的BeautifulSoup从超链接中获取URL

我真正想得到的是一个字符串中的完整url,例如“http://www.somewebsite.com/networks/site-info?site_id=ABD

>>>site_link = soup.find_all('a', string='Aberdeen')[0] 
>>>site_row = site_link.findParent('td').findParent('tr') 
>>>site_column = site_row.findAll('td') 
>>>site_url = site_column[0].contents[0] 
>>>print(site_url) 

<a href="../networks/site-info?site_id=ABD">Aberdeen</a> 

我没有到目前为止任何运气,不知道什么尝试。我如何获得网址?

+0

看看[这](https://stackoverflow.com/a/1080472/7654934)。希望这可以帮助! –

+0

我试图抓取的网页是https://uk-air.defra.gov.uk/latest/currentlevels,我对表格第一列中与网站名称相对应的网址感兴趣,例如https://uk-air.defra.gov.uk/networks/site-info?site_id=ACTH第一个名字是Auchencorth Moss – Paulos

+0

@ N.Ivanov我尝试了类似的东西,但问题是有很多不同的页面上的链接类型,我只是想说的链接 – Paulos

回答

2

您可以使用正则表达式来获取使用urljoin的链接以获取正确的URL。

import requests 
import re 

try: 
    from urlparse import urljoin # Python2 
except ImportError: 
    from urllib.parse import urljoin # Python3 

from bs4 import BeautifulSoup 
url= 'https://uk-air.defra.gov.uk/latest/currentlevels' 
r = requests.get(url, headers={'User-Agent': 'Not blank'}) 
data = r.text 
soup = BeautifulSoup(data, 'html.parser') 
for elem in soup('a', href=re.compile(r'site_id')): 
    print (elem.text) 
    print (urljoin(url,elem['href'])) 

输出:

Auchencorth Moss 
https://uk-air.defra.gov.uk/networks/site-info?site_id=ACTH 
Bush Estate 
https://uk-air.defra.gov.uk/networks/site-info?site_id=BUSH 
Dumbarton Roadside 
https://uk-air.defra.gov.uk/networks/site-info?site_id=DUMB 
Edinburgh St Leonards 
https://uk-air.defra.gov.uk/networks/site-info?site_id=ED3 
Glasgow Great Western Road 
https://uk-air.defra.gov.uk/networks/site-info?site_id=GGWR 
Glasgow High Street 
https://uk-air.defra.gov.uk/networks/site-info?site_id=GHSR 
... 

如果你只是想用仔:中

for elem in soup('a',href=re.compile(r'site_id'), string='Aberdeen'): 

代替:

for elem in soup('a', href=re.compile(r'site_id')): 

输出:

Aberdeen 
https://uk-air.defra.gov.uk/networks/site-info?site_id=ABD 
0

试试这个。我希望这将满足您的所有需求:

import requests ; from lxml import html 

base_link = "https://uk-air.defra.gov.uk" 
response = requests.get("https://uk-air.defra.gov.uk/latest/currentlevels", headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}).text 
tree = html.fromstring(response) 
for title in tree.cssselect("table.current_levels_table td a:not(.smalltext)"): 
    print(base_link + title.attrib['href'][2:]) 

部分结果:

https://uk-air.defra.gov.uk/networks/site-info?site_id=ACTH 
https://uk-air.defra.gov.uk/networks/site-info?site_id=BUSH 
https://uk-air.defra.gov.uk/networks/site-info?site_id=DUMB 
https://uk-air.defra.gov.uk/networks/site-info?site_id=ED3 
https://uk-air.defra.gov.uk/networks/site-info?site_id=GGWR 
https://uk-air.defra.gov.uk/networks/site-info?site_id=GHSR 
https://uk-air.defra.gov.uk/networks/site-info?site_id=GLA4 
相关问题