我是Scrapy的新手,目前我正在尝试编写一个CrawlSpider来抓取Tor darknet上的论坛。目前我CrawlSpider代码:如何使用我的scrapy CrawlSpider将相对路径转换为绝对路径?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['http://answerstedhctbek.onion/questions']
allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion']
rules = (
Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'),
Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath')
)
def makeAbsolutePath(links):
for i in range(links):
links[i] = links[i].replace("../","")
return links
由于论坛使用相对路径,我试图创建一个自定义process_links去掉“../”但是当我跑我的代码我仍然recieving:
2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions)
正如您所见,由于路径不正确,我仍然收到400个错误。为什么我的代码不能从链接中删除“../”?
谢谢!
Aufziehvogel,它终于正常工作,谢谢你!直到我在makeAbsolutePath中添加'self'作为参数之前,我无法收到上面提到的任何错误。因此,添加“自己”,包括您提到的所有其他解决方案已解决了这个问题网址仍然是错误的,但我可以简单地包含链接[i] .url = links [i] .url.replace('../','') – ToriTompkins