您可以使用正则表达式来获取使用urljoin的链接以获取正确的URL。
import requests
import re
try:
from urlparse import urljoin # Python2
except ImportError:
from urllib.parse import urljoin # Python3
from bs4 import BeautifulSoup
url= 'https://uk-air.defra.gov.uk/latest/currentlevels'
r = requests.get(url, headers={'User-Agent': 'Not blank'})
data = r.text
soup = BeautifulSoup(data, 'html.parser')
for elem in soup('a', href=re.compile(r'site_id')):
print (elem.text)
print (urljoin(url,elem['href']))
输出:
Auchencorth Moss
https://uk-air.defra.gov.uk/networks/site-info?site_id=ACTH
Bush Estate
https://uk-air.defra.gov.uk/networks/site-info?site_id=BUSH
Dumbarton Roadside
https://uk-air.defra.gov.uk/networks/site-info?site_id=DUMB
Edinburgh St Leonards
https://uk-air.defra.gov.uk/networks/site-info?site_id=ED3
Glasgow Great Western Road
https://uk-air.defra.gov.uk/networks/site-info?site_id=GGWR
Glasgow High Street
https://uk-air.defra.gov.uk/networks/site-info?site_id=GHSR
...
如果你只是想用仔:中
for elem in soup('a',href=re.compile(r'site_id'), string='Aberdeen'):
代替:
for elem in soup('a', href=re.compile(r'site_id')):
输出:
Aberdeen
https://uk-air.defra.gov.uk/networks/site-info?site_id=ABD
看看[这](https://stackoverflow.com/a/1080472/7654934)。希望这可以帮助! –
我试图抓取的网页是https://uk-air.defra.gov.uk/latest/currentlevels,我对表格第一列中与网站名称相对应的网址感兴趣,例如https://uk-air.defra.gov.uk/networks/site-info?site_id=ACTH第一个名字是Auchencorth Moss – Paulos
@ N.Ivanov我尝试了类似的东西,但问题是有很多不同的页面上的链接类型,我只是想说的链接 – Paulos