我刚刚接触网络抓取,刚开始尝试使用Python编写的抓取框架Scrapy。我的目标是刮掉旧的雅虎集团,因为他们没有提供API或任何其他手段来检索邮件归档。雅虎组的设置使得您必须先登录才能查看存档。使用scrapy刮掉雅虎组的问题
我需要完成,我认为,这些步骤是:
- 登录到雅虎
- 访问的URL第一条消息和刮它
- 重复步骤2的下一条消息,等
我开始粗加工一个scrapy蜘蛛来完成上述工作,这里是我迄今为止所做的。我想观察的是,登录工作,我能够检索第一条消息。
class Sg101Spider(BaseSpider):
name = "sg101"
msg_id = 1 # current message to retrieve
max_msg_id = 21399 # last message to retrieve
def start_requests(self):
return [FormRequest(LOGIN_URL,
formdata={'login': LOGIN, 'passwd': PASSWORD},
callback=self.logged_in)]
def logged_in(self, response):
if response.url == 'http://my.yahoo.com':
self.log("Successfully logged in. Now requesting 1st message.")
return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
errback=self.error)
else:
self.log("Login failed.")
def parse_msg(self, response):
self.log("Got message!")
print response.body
def error(self, failure):
self.log("I haz an error")
当我虽然运行蜘蛛,我把它登录,并发出第一条消息的要求:一旦我得到这么多的工作,我会完成其余部分。但是,我在scrapy的调试输出中看到的所有内容都是3个重定向,最终到达我首先要求的URL。但scrapy不会调用我的parse_msg()
回调,并且爬行停止。这里是scrapy输出的片段:
2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login>
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None)
2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message.
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished)
2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)
我无法理解这一点。它看起来像雅虎重新定向蜘蛛(也许用于验证检查?),但它似乎回到了我想要访问的网址。但是scrapy并没有调用我的回调函数,我也没有机会抓取数据或继续爬行。
有没有人有什么想法和/或如何进一步调试?谢谢!