2016-12-03 167 views
0

我是一个初学者,使用scrapy,我遇到了这个问题登录。我只是把所有的表单数据放入FormRequest中。Scrapy蜘蛛登录问题

我的代码:

from scrapy.http import Request, FormRequest 
from scrapy.selector import Selector 
from scrapy.contrib.spiders import CrawlSpider 

class login_spider(CrawlSpider): 
    name = 'login_spider' 

    FORM = {"_xsrf":"776a978b48e9e828a939c096ae9b787e", 
     "password":"...", 
     "captcha_type":"cn", 
     "email":"...", 
     } 

    COOKIES = { 
    "q_c1":"201afdf74fab4f538d15fd8726c1fe14|1480730632000|1480730632000", 
    "_xsrf":"776a978b48e9e828a939c096ae9b787e", 
    "l_cap_id":"MDE2MzhmNGUwN2FjNDA1ZTk3NDc5ZDZkZmJhMzM3Y2M=|1480730632|83da14e1526864adfa6e0bec5a9f49bf46f8c460", 
      "cap_id":"OGY2MWMzODIxY2VmNGQ4MGExOTk4N2UwNzU1OWFlYzM=|1480730632|77b6eaaca21f9c96ecfa5d5c9832e34dc2e401e0", 
    "d_c0":"ADDCXsSu8AqPTuqHLcmhlUeOsUY-UBuyRL0=|1480730633", 
     "r_cap_id":"Mjg0YTg2NTcxMjAxNDU2YTljZGNhMjQ1MzVlMjE4ZmI=|1480730633|cd2007eb5d1c6939ac1954b79b83f0d7b5d9e937", 
    "_zap":"57aed33d-98b6-4e98-bad4-71581265abde", 
    "__utmt":1, 
    "__utma":"51854390.1175567315.1480730634.1480730634.1480730634.1", 
    "__utmb":"51854390.4.10.1480730634", 
    "__utmc":"51854390", 
    "__utmz":"51854390.1480730634.1.1.utmcsr=bing|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)", 
    "__utmv":"51854390.000--|3=entry_date=20161203=1", 
    "n_c":1, 
} 

    HEADERS = { 
    "Accept":"*/*", 
    "Accept-Encoding":"gzip, deflate, br", 
    "Accept-Language":"en-US,en;q=0.8", 
    "Connection":"keep-alive", 
    "Content-Length":"100", 
    "Content-Type":"application/x-www-form-urlencoded; charset=UTF-8", 
    "Host":"www.zhihu.com", 
    "Origin":"https://www.zhihu.com", 
    "Referer":"https://www.zhihu.com/", 
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36", 
    "X-Requested-With":"XMLHttpRequest", 
    "X-Xsrftoken":"776a978b48e9e828a939c096ae9b787e", 
} 

    def start_requests(self): 
     return [Request(url="https://www.zhihu.com/#signin", callback=self.login)] 

    def login(self, response): 
     return [FormRequest(
      "https://www.zhihu.com/#signin", 
      formdata=self.FORM, 
      cookies=self.COOKIES, 
      headers=self.HEADERS, 
      callback=self.after_login, 
      dont_filter=True 
     )] 

    def after_login(self, response): 
     print("================\n") 
     print("=== LOG IN ===\n") 
     print("================\n") 

我从这里获取表单数据: The email % pwd are randomly created

而且我得到这些:

2016-12-03 11:07:07 [scrapy] INFO: Spider opened 
2016-12-03 11:07:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-12-03 11:07:07 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 
2016-12-03 11:07:07 [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/robots.txt> (referer: None) 
2016-12-03 11:07:07 [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/#signin> (referer: None) 
2016-12-03 11:07:07 [scrapy] DEBUG: Crawled (400) <POST https://www.zhihu.com/#signin> (referer: https://www.zhihu.com/) ['partial'] 
2016-12-03 11:07:08 [scrapy] DEBUG: Ignoring response <400 https://www.zhihu.com/>: HTTP status code is not handled or not allowed 
2016-12-03 11:07:08 [scrapy] INFO: Closing spider (finished) 

然后我想这一点,在设置中添加这些代码.py:

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36" 
RETRY_ENABLED = True 
RETRY_HTTP_CODES = [400,403,500] 
RETRY_TIMES = 2 
DOWNLOADER_MIDDLEWARES = { 
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,} 

但我仍然得到同样的错误。我不知道该怎么办。那么我做错了什么部分,我应该怎样修改?

+0

看起来您的问题由'忽略响应<400 https:/ /www.zhihu.com/>:HTTP状态码未被处理或不被允许。看看[这个问题](http://stackoverflow.com/questions/32779766/auth-failing-999-http-status-code-is-not-handled-or-not-allowed)。您还需要检查您的请求,因为该请求可能由400响应表示。 – danielunderwood

回答

1

当提供无效的CSRF令牌时,有时会返回状态码400。每次访问页面时,CSRF令牌都会发生更改,看起来您已经硬编码了静态令牌。您的脚本将需要使用登录表单向页面发出初始请求,将CSRF令牌保存在变量中,然后登录。