2017-11-11 553 views
0

我是Scrapy的新手,目前我正在尝试编写一个CrawlSpider来抓取Tor darknet上的论坛。目前我CrawlSpider代码:如何使用我的scrapy CrawlSpider将相对路径转换为绝对路径?

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class HiddenAnswersSpider(CrawlSpider): 
    name = 'ha' 
    start_urls = ['http://answerstedhctbek.onion/questions'] 
    allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion'] 
    rules = (
      Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'), 
      Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath') 

      ) 

def makeAbsolutePath(links): 
    for i in range(links): 
      links[i] = links[i].replace("../","") 
    return links 

由于论坛使用相对路径,我试图创建一个自定义process_links去掉“../”但是当我跑我的代码我仍然recieving:

2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions) 
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions) 
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions) 
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed 
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions) 

正如您所见,由于路径不正确,我仍然收到400个错误。为什么我的代码不能从链接中删除“../”?

谢谢!

回答

0

问题可能是makeAbsolutePaths不是蜘蛛类的一部分。 The documentation states

process_links is a callable, or a string (in which case a method from the spider object with that name will be used)

你没有使用makeAbsolutePathsself,所以我认为它不是一个缩进错误。 makeAbsolutePaths也有一些其他的错误。如果我们的代码纠正了这种状态:

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 


class HiddenAnswersSpider(CrawlSpider): 
    name = 'ha' 
    start_urls = ['file:///home/user/testscrapy/test.html'] 
    allowed_domains = [] 
    rules = (
      Rule(LinkExtractor(allow=(r'.*')), follow=True, process_links='makeAbsolutePath'), 
      ) 

    def makeAbsolutePath(self, links): 
     print(links) 
     for i in range(links): 
      links[i] = links[i].replace("../","") 
     return links 

它会产生这样的错误:

TypeError: 'list' object cannot be interpreted as an integer 

这是因为在调用用于rangelen()没有呼叫和range只能操作上整数。它想一个数字,会给你的范围从0到这个数减1

修复此问题后,它会给出错误:

AttributeError: 'Link' object has no attribute 'replace' 

这是 - 因为不像你想 - links是不是包含href=""属性内容的字符串列表。相反,它是一个Link对象的列表。

我建议你在makeAbsolutePath的内部输出links的内容,看看你是否需要做任何事情。在我看来,即使该网站使用..运营商没有实际的文件夹级别(因为该URL是/questions而不是/questions/),scrapy应该已经停止解析..运营商一旦达到域级别,因此您的链接应该指向http://answerstedhctbek.onion/<number>/<title>

不知怎的,像这样:

def makeAbsolutePath(self, links): 
     for i in range(len(links)): 
      print(links[i].url) 

     return [] 

(这里返回一个空表给你的优点是,蜘蛛将停止,您可以检查控制台输出)

如果你再发现,该网址实际上是错误的,你可以通过url属性执行一些工作:

links[i].url = 'http://example.com' 
+0

Aufziehvogel,它终于正常工作,谢谢你!直到我在makeAbsolutePath中添加'self'作为参数之前,我无法收到上面提到的任何错误。因此,添加“自己”,包括您提到的所有其他解决方案已解决了这个问题网址仍然是错误的,但我可以简单地包含链接[i] .url = links [i] .url.replace('../','') – ToriTompkins