线程加快从网站

给定的名单刮数据我写的，从网站（100个链接）来访问给定列表擦伤信息的程序。目前，我的程序依次执行此操作;即一次检查一个。我的程序框架如下。线程加快从网站

for j in range(len(num_of_links)): 
    try: #if error occurs, this jumps to next of the list of website 
     site_exist(j) #a function to check if site exists 
     get_url_with_info(j) #a function to get links inside the website 
    except Exception as e: 
     print(str(e)) 
filter_result_info(links_with_info) #function that filters result

不用说，这个过程非常缓慢。因此，是否有可能实现线程化，以便我的程序可以更快地处理作业，以便4个并发作业分别删除链接列表25。你能否提供一个关于我如何做到这一点的参考？

来源

2015-10-18 JPdL

您的列表可能包含同一网站上的很多链接吗？如果是这样，你应该放慢你的抓取速度并且不要加快速度，而不这样做可能会出现单个目标，就像拒绝服务尝试一样。 – halfer

如果这些站点全都不同，并且您使用的是卷曲，那么请查看“cURL multi”功能，该功能可让您以并行方式启动多个HTTP操作。一个人可以在PHP中做到这一点，所以我保证Python也允许它（它是相同的底层库）。 – halfer

@halfer你是对的。我应该在某个网站上完成抓取速度的某种阈值（我不知道如何）。是的，列表中的网站都有不同的内容。这个想法是在列表中的所有网站上获得一定的信息。目前，我的程序在4分钟内完成所有网站的搜索。我只是觉得我可以加快速度。感谢您指出cURL多。我会看看。 – JPdL

线程不会加快速度。多处理可能是你想要的。

Multiprocessing vs Threading Python

来源

2015-10-18 08:25:41 Chromadude

这不是事实，线程通过使函数调用非阻塞（异步）来加速Web请求。你只需要知道如何去做。 – lingxiao

@LingxiaoXia哦，对了，哎呀。我的错 – Chromadude

你想要的是一个Pool of threads。

from concurrent.futures import ThreadPoolExecutor 


def get_url(url): 
    try: 
     if site_exists(url): 
      return get_url_with_info(url) 
     else: 
      return None 
    except Exception as error: 
     print(error) 


with ThreadPoolExecutor(max_workers=4) as pool: 
    future = pool.map(get_url, list_of_urls) 

list_of_results = future.results() # waits until all URLs have been retrieved 
filter_result_info(list_of_results) # note that some URL might be None

来源

2015-10-18 09:46:02 noxdafox

线程加快从网站

回答

相关问题