ThreadPoolExecutor vs threading.Thread

我有一个关于ThreadPoolExecutor vs Thread类的性能问题，在我看来，我缺乏一些基本的理解。ThreadPoolExecutor vs threading.Thread

我有两个功能的网络刮板。首先来分析一个网站主页和第二的每个图像的链接，加载图像关闭解析链接：

import threading 
import urllib.request 
from bs4 import BeautifulSoup as bs 
import os 
from concurrent.futures import ThreadPoolExecutor 

path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test' 
url = 'https://www.bbc.co.uk' 

# Function to parse link anchors for images 
def img_links_parser(url, links_list): 
    res = urllib.request.urlopen(url) 
    soup = bs(res,'lxml') 
    content = soup.findAll('div',{'class':'top-story__image'}) 

    for i in content: 
     try: 
      link = i.attrs['style'] 
      # Pulling the anchor from parentheses 
      link = link[link.find('(')+1 : link.find(')')] 
      # Putting the anchor in the list of links 
      links_list.append(link) 
     except: 
      # links might be under 'data-lazy' attribute w/o paranthesis 
      links_list.append(i.attrs['data-lazy']) 

# Function to load images from links 
def img_loader(base_url, links_list, path_location): 
    for link in links_list: 
     try: 
      # Pulling last element off the link which is name.jpg 
      file_name = link.split('/')[-1] 
      # Following the link and saving content in a given direcotory 
      urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link), 
      os.path.join(path_location, file_name)) 
     except: 
      print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))

下面的代码是在两种情况分裂：

案例1：我使用多线程：

threads = [] 
t1 = threading.Thread(target = img_loader, args = (url, links[:10], path)) 
t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path)) 
t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path)) 
t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path)) 
t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path)) 
t6 = threading.Thread(target = img_loader, args = (url, links[50:], path)) 

threads.extend([t1,t2,t3,t4,t5,t6]) 
for t in threads: 
    t.start() 
for t in threads: 
    t.join()

上述代码在我的机器上执行了10秒钟的工作。

情况2：我使用ThreadPoolExecutor

with ThreadPoolExecutor(50) as exec: 
    results = exec.submit(img_loader, url, links, path)

上面的代码结果18秒。

我的理解是，ThreadPoolExecutor为每个工人创建一个线程。所以，假设我将max_workers设置为50会导致50个线程，因此应该更快地完成作业。

有人可以请解释我在这里错过了什么？我承认我在这里犯了一个愚蠢的错误，但我不明白。

非常感谢！

来源

2017-12-27 Vlad

只是作为@hansaplast注意，我只用一个工人。所以我只是改变了我的'img_loader'函数来接受一个单独的链接，然后在上下文管理器下面添加一个'for'循环来处理列表中的每个链接。它将时间缩短到3.8秒。 – Vlad

在案例2中，您将所有链接发送给一名工作人员。取而代之的

exec.submit(img_loader, url, links, path)

你需要：

for link in links: 
    exec.submit(img_loader, url, [link], path)

我不尝试一下我自己，从reading the documentation of ThreadPoolExecutor

来源

2017-12-27 17:00:53 hansaplast

是的，你是对的。我不知道为什么我自己也没有尝试过，尽管我也考虑过这个问题。非常感谢您回复我！结果是3.8秒，这很酷！ :) – Vlad

ThreadPoolExecutor vs threading.Thread

回答

相关问题