2015-07-20 58 views
1

这可能会容易有经验的用户,但我是scrapy的新手,我想要的是爬到用户定义页面的蜘蛛。现在我正试图修改中的allow pattern,但它似乎不起作用。目前,我的代码抽象为:scrapy递归爬取到用户定义页面

class MySpider(CrawlSpider): 

    name   = "example" 
    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com/alpha"] 
    pattern = "/[\d]+$" 
    rules = [ 
       Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a',)), callback='parse_items', follow=True), 
      ] 

    def __init__(self, argument='' ,*a, **kw): 

     super(MySpider, self).__init__(*a, **kw) 

     #some inputs and operations based on those inputs 

     i = str(raw_input()) #another input 

     #need to change the pattern here 
     self.pattern = '/' + i + self.pattern 

     #some other operations 
     pass 


    def parse_items(self, response): 

     hxs = HtmlXPathSelector(response) 
     img = hxs.select('//*[@id="imgholder"]/a')  
     item = MyItem() 
     item["field1"] = "something" 
     item["field2"] = "something else" 
     yield item 
     pass 

现在假设用户进入i=2所以我想转到的网址与/2/*some number*结束,但发生了什么,现在是蜘蛛爬行模式/*some number的东西。更新似乎没有传播。我正在使用scrapy version 1.0.1

任何方法?提前致谢。

回答

1

当你有你的__init__方法被称为Rule已经建立与在开始时定义的模式。

但是,您可以在__init__方法内动态更改它。要做到这一点在方法体内再次设置Rule并编译它(如下所示):

def __init__(self, argument='' ,*a, **kw): 
    super(MySpider, self).__init__(*a, **kw) 
    # set your pattern here to what you need it 
    MySpider.rules = rules = [ Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a',)), callback='parse_items', follow=True), ] 
    # now it is time to compile the new rules: 
    super(MySpider, self)._compile_rules() 
+0

感谢它的工作。 – helix

+0

嗨!你介意给予帮助吗? http://stackoverflow.com/questions/31630771/scrapy-linkextractor-duplicating – yukclam9