2
在此scrapy中,我想点击转到商店在新标签页中打开url并关闭并移动到原始选项卡。但脚本发生错误。Selenium无法切换标签和解压缩网址
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium import webdriver
from urlparse import urljoin
import time
from selenium.webdriver.common.keys import Keys
class CompItem(scrapy.Item):
model_name = scrapy.Field()
model_link = scrapy.Field()
url =scrapy.Field()
class criticspider(CrawlSpider):
name = "extract"
allowed_domains = ["mysmartprice.com"]
start_urls = ["http://www.mysmartprice.com/computer/lenovo-g50-70-laptop-msf201821"]
def __init__(self, *args, **kwargs):
super(criticspider, self).__init__(*args, **kwargs)
self.download_delay = 0.25
self.browser = webdriver.Firefox()
self.browser.implicitly_wait(20)
def parse_start_url(self, response):
self.browser.get(response.url)
item = CompItem()
time.sleep(10)
items = []
# Save the window opener (current window, do not mistaken with tab... not the same)
button = self.browser.find_element_by_xpath("/html/body/div[3]/div/div[3]/div/div[2]/div[4]/div[4]/div[5]/div[1]")
main_window = self.browser.current_window_handle
# Open the link in a new tab by sending key strokes on the element
# Use: Keys.CONTROL + Keys.SHIFT + Keys.RETURN to open tab on top of the stack
button.send_keys(Keys.CONTROL + Keys.RETURN)
# Switch tab to the new tab, which we will assume is the next one on the right
self.browser.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.TAB)
time.sleep(10)
# Put focus on current window which will, in fact, put focus on the current visible tab
self.browser.switch_to_window(main_window)
item['url'] = self.browser.current_url
# do whatever you have to do on this page, we will just got to sleep for now
time.sleep(2)
# Close current tab
self.browser.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')
yield item
该代码没有引发任何错误,我试图在多个浏览器中使用。但无法找到最新的错误?
如何让所有的商店页面的所有网址? –
我只想要所有的url不是我在start_url中解析过的那个商店的url,如何忽略它? –
@JohnDene我已经添加了一个注释,您可能想要增加页面加载超时以允许它在读取'current_url'之前加载。 – alecxe