2017-05-14 73 views
1

我与Python的硒网络驱动器(Chrome)Python中硒的网络驱动器多处理

我可以使用多个驱动器,并让每个驾驶员的抓取图像抓取图像?

我想多处理做以下事情

源代码

def crawl(searchText): 
    driver = webdriver.Chrome('C:\\Users\\HYOWON\\Desktop\\Desktop\\Graduation\\Code\\Crawling\\chromedriver.exe') 

    searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchText) 

    driver.get(searchUrl) 

    imgs_urls = [] # Url 저장 배열 
    cnt = 0 

    for j in range(20): 
    element = driver.find_element_by_css_selector("div[data-ri = '" + str(cnt + j) + "'] img") 
     element.click() 
     sleep(1) 

     soup = create_soup() 

     for img in soup.find_all('img'): 
      try: 
       if img['src'].startswith('http') and img['src'].endswith('jpg'): 
        imgs_urls.append(img['src']) 
      except: 
       pass 

    driver.close() 
    return(imgs_urls) 

修改代码

def crawl(): 
    imgs_urls = [] 
    for j in range(50): 
     element1 = driver1.find_element_by_css_selector("div[data-ri = '" + str(cnt) + "'] img") 
     element2 = driver2.find_element_by_css_selector("div[data-ri = '" + str(cnt) + "'] img") 
     element3 = driver3.find_element_by_css_selector("div[data-ri = '" + str(cnt) + "'] img") 

     element1.click() 
     WebDriverWait(driver1, 1) 
     soup1 = create_soup(driver1) 

     for img in soup1.find_all('img'): 
      try: 
       if img['src'].startswith('http') and img['src'].endswith('jpg'): # http로 시작 jpg로 끝나는것만 
       imgs_urls.append(img['src']) 
      except: # 예외 pass 
       pass 

     element2.click() 
     WebDriverWait(driver2, 1) 
     soup2 = create_soup(driver2) 

     for img in soup2.find_all('img'): 
      try: 
       if img['src'].startswith('http') and img['src'].endswith('jpg'): 
       imgs_urls.append(img['src']) 
      except: # 예외 pass 
       pass 

     element3.click() 
     WebDriverWait(driver3, 1) 
     soup3 = create_soup(driver3) 


     for img in soup3.find_all('img'): 
      try: 
       if img['src'].startswith('http') and img['src'].endswith('jpg'): 
       imgs_urls.append(img['src']) 
      except: # 예외 pass 
       pass 

     cnt += 3 

    return (imgs_urls) 

def download_img(url, filename): 
    full_name = str(filename) + ".jpg" 
    urllib.request.urlretrieve(url, 'C:/Python/' + full_name) 

for url in crawl(): 
    download_img(url, filename) 
+0

您需要实现一个实际的多处理队列。硒阻塞意味着它会阻止你的蟒蛇做其他事情。驱动程序1请求一个页面,驱动程序2在驱动程序1完成之前不能执行任何操作。这是通过多处理库解决的。 – eusid

回答

0

事实上,你可以!我一直在考虑为我正在开发的当前项目使用多驱动程序解决方案。

在这个例子中,我只是单独声明驱动程序对象,尽管我个人想将它们放入某种数组中,以便更容易地引用它们,以便可以遍历它们。当然,这会使你的代码结构有点不同,尽管你不应该在这里遇到太多问题。

from selenium import webdriver 
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

baseURL_1 = "http://www.stackoverflow.com/" 
baseURL_2 = "http://www.google.com/" 

def main(): 
    init() 
    initialPage() 
    return 

def init(): 
    global drv1 
    global drv2 

    chromedrvPath = "C:\\path_to_chrome\\chromedriver.exe" 
    opt = webdriver.ChromeOptions() 
    opt.add_experimental_option('prefs', { 
     'credentials_enable_service': False, 
     'profile': { 
      'password_manager_enabled': False 
     } 
    }) 
    drv1 = webdriver.Chrome(chromedrvPath,chrome_options=opt) 
    drv2 = webdriver.Chrome(chromedrvPath,chrome_options=opt) 

    return 

def initialPage(): 
    navigate(baseURL_1,1) 
    navigate(baseURL_2,2) 
    return 

def navigate(URL,d): 
    if(d == 1): 
     drv1.get(URL) 
    if(d == 2): 
     drv2.get(URL) 
    return 

if __name__ == "__main__": 
    main() 
+0

非常感谢。我试图用我的方式修复上面的代码,但是我得到了一个错误403禁止我可以修复它吗?请参阅**修改代码** –