Python多处理 - 按需使用工作人员

-1

我不知道页数。 这是原来的代码：

 next_button=soup.find_all('a',{'class':"btn-page_nav right"}) 
     while next_button: 
      link=next_button[0]['href'] 
      resp=requests.get('webpage+link) 
      soup=BeautifulSoup(resp.content) 
      table=soup.find('table',{'class':'js-searchresults'}) 
      body=table.find('tbody') 
      rows=body.find_all('tr') 
      function(rows) 
      next_button=soup.find_all('a',{'class':"btn-page_nav right"})

它工作正常，function(rows)是解析每个页面的一部分的功能。

我想要做的是使用multiprocessing解析这些页面。我想过使用3名工人的pool，以便我可以一次处理3页，但我无法弄清楚如何实施它。

一种解决方案是这样的：

rows_list=[] 
next_button=soup.find_all('a',{'class':"btn-page_nav right"}) 
while next_button: 
    link=next_button[0]['href'] 
    resp=requests.get('webpage+link) 
    soup=BeautifulSoup(resp.content) 
    table=soup.find('table',{'class':'js-searchresults'}) 
    body=table.find('tbody') 
    rows=body.find_all('tr') 
    rows_list.append(rows) 
    next_button=soup.find_all('a',{'class':"btn-page_nav right"})

等待程序遍历所有页面，然后：

pool=multiprocessing.Pool(processes=4) 
pool.map(function,rows_list)

但我不认为这会提高性能太多了，我希望主进程遍历页面，一旦打开页面，就将其发送给工作人员。 这个怎么办？一个虚拟的例子：

pool=multiprocessing.Pool(processes=4) 

next_button=soup.find_all('a',{'class':"btn-page_nav right"}) 
while next_button: 
    link=next_button[0]['href'] 
    resp=requests.get('webpage+link) 
    soup=BeautifulSoup(resp.content) 
    table=soup.find('table',{'class':'js-searchresults'}) 
    body=table.find('tbody') 
    rows=body.find_all('tr') 

    **pool.send_to_idle_worker(rows)** 

    next_button=soup.find_all('a',{'class':"btn-page_nav right"})

来源

2017-10-19 Mike

你可以使用concurrent包，而不是multiprocessing。例如：

import concurrent.futures 

with concurrent.futures.ProcessPoolExecutor() as executor: 
    while next_button: 
     rows = ... 
     executor.submit(function, rows) 
     next_button = ...

可以与劳动者与executor = ProcessPoolExecutor(max_workers=10)任意数量实例化executor，但如果不给，max_workers将默认达您计算机上的内核。 Further details in the python docs。

来源

2017-10-19 10:14:17 hoefling

您能用Pool.apply_async()代替Pool.map()吗？ Apply_async不会阻止并允许主程序继续处理更多行。它也不需要你的主程序准备好所有的数据进行映射。您只需将一个块作为参数传递给apply_async()。

来源

2017-10-19 10:14:38 Hannu

Python多处理 - 按需使用工作人员

回答

相关问题