初始问题

我写一个CrawlSpider类（使用scrapy库）的方法创建单元测试和依靠大量的scrapy异步魔法，使其工作。这是，剥离下来：为scrapy CrawlSpider

class MySpider(CrawlSpider): 
    rules = [Rule(LinkExtractor(allow='myregex'), callback='parse_page')] 
    # some other class attributes 

    def __init__(self, *args, **kwargs): 
     super(MySpider, self).__init__(*args, **kwargs) 
     self.response = None 
     self.loader = None 

    def parse_page_section(self): 
     soup = BeautifulSoup(self.response.body, 'lxml') 
     # Complicated scraping logic using BeautifulSoup 
     self.loader.add_value(mykey, myvalue) 

    # more methods parsing other sections of the page 
    # also using self.response and self.loader 

    def parse_page(self, response): 
     self.response = response 
     self.loader = ItemLoader(item=Item(), response=response) 
     self.parse_page_section() 
     # call other methods to collect more stuff 
     self.loader.load_item()

class属性rule告诉我蜘蛛遵循一定的联系，并跳转到一个回调函数一旦网络页面下载。我的目标是测试称为parse_page_section的解析方法，无需运行爬虫，甚至无需发出真正的HTTP请求。

我试过

出于本能，我转身向mock库。我明白你是如何模拟一个函数来测试它是否已被调用（哪些参数以及是否有任何副作用......），但这不是我想要的。我想实例化一个假对象MySpider并分配足够的属性以便能够调用parse_page_section方法。

在上述例子中，我需要一个response对象来实例化我ItemLoader和具体为self.response.body属性来实例我BeautifulSoup。原则上，我可以做虚假对象是这样的：

from argparse import Namespace 

my_spider = MySpider(CrawlSpider) 
my_spider.response = NameSpace(body='<html>...</html>')

行之有效，为BeautifulSoup类，但我需要增加更多的属性创建ItemLoader对象。对于更复杂的情况，它会变得丑陋难以控制。

我的问题

这是完全正确的方法吗？我在网上找不到类似的例子，所以我认为我的方法在更基础的层面上可能是错误的。任何有识之士将不胜感激。

来源

2016-04-28 cyberbikepunk

@ChrisP感谢您的编辑。我并没有把scrapy标签放在首位，因为我认为这个问题一般与单元测试有关。 – cyberbikepunk

这绝对是单元测试，但是大量进行刮擦的人可能会对单元测试刮板有一些独特的见解。 – ChrisP

在这个特殊的'CrawlSpider'的情况下，我可以摆脱伪造响应对象。手工操作很困难，但这有帮助吗？ http://requests-mock.readthedocs.io/en/latest/overview.html。这会是一个好方法吗？ – cyberbikepunk

你见过Spiders Contracts？

这允许你测试你的蜘蛛的每个回调，而不需要很多代码。例如：

def parse(self, response): 
    """ This function parses a sample response. Some contracts are mingled 
    with this docstring. 

    @url http://www.amazon.com/s?field-keywords=selfish+gene 
    @returns items 1 16 
    @returns requests 0 0 
    @scrapes Title Author Year Price 
    """

使用check命令运行合同检查。

看看这个answer，如果你想要更大的东西。

来源

2016-04-28 15:27:38

我认为这是有意义的，因为网站本身可以改变，所以用*真实生活*（集成）测试代替单元测试。从本质上讲，你的单元测试工作并不能保证你的拼写工作。感谢您的建议。 – cyberbikepunk

虽然在单元测试中仍然有价值，但至少在编码时会进行理智检查。你提供的另一个答案（http://stackoverflow.com/questions/6456304/scrapy-unit-testing/12741030#12741030]显示了如何通过实际使用'scrapy'' Request'来更好地伪造一个响应对象，并且'响应'对象。好的提示。 – cyberbikepunk

为scrapy CrawlSpider

初始问题

我试过

我的问题

回答

相关问题