1
这可能会容易有经验的用户,但我是scrapy的新手,我想要的是爬到用户定义页面的蜘蛛。现在我正试图修改中的allow pattern
,但它似乎不起作用。目前,我的代码抽象为:scrapy递归爬取到用户定义页面
class MySpider(CrawlSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/alpha"]
pattern = "/[\d]+$"
rules = [
Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a',)), callback='parse_items', follow=True),
]
def __init__(self, argument='' ,*a, **kw):
super(MySpider, self).__init__(*a, **kw)
#some inputs and operations based on those inputs
i = str(raw_input()) #another input
#need to change the pattern here
self.pattern = '/' + i + self.pattern
#some other operations
pass
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
img = hxs.select('//*[@id="imgholder"]/a')
item = MyItem()
item["field1"] = "something"
item["field2"] = "something else"
yield item
pass
现在假设用户进入i=2
所以我想转到的网址与/2/*some number*
结束,但发生了什么,现在是蜘蛛爬行模式/*some number
的东西。更新似乎没有传播。我正在使用scrapy version 1.0.1
。
任何方法?提前致谢。
感谢它的工作。 – helix
嗨!你介意给予帮助吗? http://stackoverflow.com/questions/31630771/scrapy-linkextractor-duplicating – yukclam9