来自新波士顿的Python Web爬虫

我最近在使用python编写web爬虫时观看了新视频视频。出于某种原因，我得到一个SSLError。我试图用第6行代码修复它，但没有运气。任何想法为什么它会抛出错误？该代码是从逐字记录的新波士顿。来自新波士顿的Python Web爬虫

import requests 
from bs4 import BeautifulSoup 

def creepy_crawly(max_pages): 
    page = 1 
    #requests.get('https://www.thenewboston.com/', verify = True) 
    while page <= max_pages: 

     url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page) 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text) 

     for link in soup.findAll('a', {'class' : 'item-name'}): 
      href = "https://www.thenewboston.com" + link.get('href') 
      print(href) 

     page += 1 

creepy_crawly(1)

来源

2014-11-24 Steven

SSL错误是由于到Web证书。它可能是因为你试图抓取的url是'https'。尝试只有http的其他网站。 – Craicerjack 2014-11-24 19:24:02

可能的重复http://stackoverflow.com/q/10667960/783219 – Prusse 2014-11-24 19:46:30

谢谢Craicerjack！我在网站上尝试了它，而不仅仅是“http”，它起作用了！但是，我将如何去使用“https”在域上运行网络爬虫？ – Steven 2014-11-24 20:10:12

我使用的urllib，它可以更快地做了一个网络爬虫，没有问题访问https网页，但有一件事是，它不验证服务器证书，这使其更快更危险（易受mitm攻击）。娄有这么LIB的使用示例：

link = 'https://www.stackoverflow.com'  
html = urllib.urlopen(link).read() 
print(html)

3系是所有你需要从一个页面抓取的HTML，简单，不是吗？

我也建议你使用正则表达式的HTML抢等环节，对于一个例子（重新使用库）将是：

for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I): # Searches the HTML for other URLs 
     link = url.split("#", 1)[0] \ 
     if url.startswith("http") \ 
     else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it

来源

2016-11-29 06:19:42 ArthurG

是不是一般的规则，你不应该使用正则表达式来解析HTML？ – Steven 2016-12-05 18:00:55

正则表达式在许多语言中被认为是很慢的，但python似乎并不是这种情况，我的网络爬虫每秒能够处理10个链接，除非你想要比这个正则表达式更快的东西能够为你服务，不用说正则表达式很精确。 – ArthurG 2016-12-06 19:00:28

来自新波士顿的Python Web爬虫

回答

相关问题