2016-10-22 84 views
0

我遵循基本的Scrapy登录。它总是有效,但在这种情况下,我遇到了一些问题。 FormRequest.from_response没有请求https://www.crowdfunder.com/user/validateLogin,而是始终将有效负载发送到https://www.crowdfunder.com/user/signup。我试着直接请求有效载荷的validateLogin,但它回应了404错误。任何想法来帮助我解决这个问题?提前致谢!!!Scrapy FormRequest不需要重定向链接

class CrowdfunderSpider(InitSpider): 
    name = "crowdfunder" 
    allowed_domains = ["crowdfunder.com"] 
    start_urls = [ 
     'http://www.crowdfunder.com/', 
    ] 

    login_page = 'https://www.crowdfunder.com/user/login/' 
    payload = {} 

    def init_request(self): 
     """This function is called before crawling starts.""" 
     return scrapy.Request(url=self.login_page, callback=self.login) 

    def login(self, response): 
     """Generate a login request.""" 
     self.payload = {'email': 'my_email', 
         'password': 'my_password'} 

     # scrapy login 
     return scrapy.FormRequest.from_response(response, formdata=self.payload, callback=self.check_login_response) 

    def check_login_response(self, response): 
     """Check the response returned by a login request to see if we are 
     successfully logged in. 
     """ 
     if 'https://www.crowdfunder.com/user/settings' == response.url: 
      self.log("Successfully logged in. :) :) :)") 
      # start the crawling 
      return self.initialized() 
     else: 
      # login fail 
      self.log("login failed :(:(:(") 

这里是通过点击登录界面,在浏览器发送的有效载荷和请求链接:

payload and request url sent by clicking login button

这里是日志信息:

2016-10-21 21:56:21 [scrapy] INFO: Scrapy 1.1.0 started (bot: crowdfunder_crawl) 
2016-10-21 21:56:21 [scrapy] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'NEWSPIDER_MODULE': 'crowdfunder_crawl.spiders', 'SPIDER_MODULES': ['crowdfunder_crawl.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'crowdfunder_crawl'} 
2016-10-21 21:56:21 [scrapy] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2016-10-21 21:56:21 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-10-21 21:56:21 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 

2016-10-21 21:56:21 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-10-21 21:56:21 [scrapy] INFO: Spider opened 

2016-10-21 21:56:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 

2016-10-21 21:56:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 

2016-10-21 21:56:21 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/robots.txt> (referer: None) 

2016-10-21 21:56:21 [scrapy] DEBUG: Redirecting (301) to <GET http://www.crowdfunder.com/user/login> from <GET https://www.crowdfunder.com/user/login/> 

2016-10-21 21:56:22 [scrapy] DEBUG: Redirecting (301) to <GET https://www.crowdfunder.com/user/login> from <GET http://www.crowdfunder.com/user/login> 

2016-10-21 21:56:22 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/user/login> (referer: None) 

2016-10-21 21:56:23 [scrapy] DEBUG: Crawled (200) <POST https://www.crowdfunder.com/user/signup> (referer: https://www.crowdfunder.com/user/login) 

2016-10-21 21:56:23 [crowdfunder] DEBUG: login failed :(:(:(
2016-10-21 21:56:23 [scrapy] INFO: Closing spider (finished) 
2016-10-21 21:56:23 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1569, 
'downloader/request_count': 5, 
'downloader/request_method_count/GET': 4, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 16313, 
'downloader/response_count': 5, 
'downloader/response_status_count/200': 3, 
'downloader/response_status_count/301': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 22, 4, 56, 23, 232493), 
'log_count/DEBUG': 7, 
'log_count/INFO': 7, 
'request_depth_max': 1, 
'response_received_count': 3, 
'scheduler/dequeued': 4, 
'scheduler/dequeued/memory': 4, 
'scheduler/enqueued': 4, 
'scheduler/enqueued/memory': 4, 
'start_time': datetime.datetime(2016, 10, 22, 4, 56, 21, 180030)} 
2016-10-21 21:56:23 [scrapy] INFO: Spider closed (finished) 

回答

1

FormRequest.from_response(response)默认使用第一种形式它发现。如果你检查什么构成页有你会看到:

In : response.xpath("//form") 
Out: 
[<Selector xpath='//form' data='<form action="/user/signup" method="post'>, 
<Selector xpath='//form' data='<form action="/user/login" method="POST"'>, 
<Selector xpath='//form' data='<form action="/user/login" method="post"'>] 

所以,你正在寻找的形式不是一日一。修复它的方法是使用许多from_response方法参数之一来指定要使用的表单。

使用formxpath是最灵活和我个人最喜欢的:

In : FormRequest.from_response(response, formxpath='//*[contains(@action,"login")]') 
Out: <POST https://www.crowdfunder.com/user/login> 
+0

真棒!感谢您的帮助!我检查了/ user/login页面,但没有找到表单标签。它似乎有所有形式在主页。 –

+0

@波文刘可以澄清一下吗? '用户/登录'页面似乎重定向到自己两次然后它contians我留在我的答案3种形式。第二种形式包含所有的输入字段,并且应该使用FormRequest。 – Granitosaurus

+0

是的,它使用第二种形式。当使用“https://www.crowdfunder.com/user/login”作为“self.login_page”时,from_response没有找到使用response.xpath(“// form”)的任何表单项,但我找到了所有你的三个表单项目使用主页“https://www.crowdfunder.com”,并通过登录。 –