2015-10-15 56 views
-3

我是Scrapy和Python的新手。使用scrapy为基础的多个搜索选项网站

我想抓一个属性注册商的网站,它使用基于查询的搜索。我看到的大多数示例都使用简单的网页,而不是通过FormRequest机制进行搜索。我写的代码如下。一切都是硬编码的。我希望能够刮掉年份或县的数据库。

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 

class SecondSpider(CrawlSpider): 
    name = "second" 

    ''' 
    def start_requests(self): 
     return [scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm"# this is the form here it asks for the following, 
            # then the linke changes to this form 
            https://www.propertypriceregister.ie/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView 
            &Start=1 
            &SearchMax=0 
            &SearchOrder=4 
            &Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011 
            &County=     # this are the fields of query 
            &Year=2010    # this are the fields of query 
            &StartMonth=    # this are the fields of query 
            &EndMonth=    # this are the fields of query 
            &Address=    # this are the fields of query 

            formdata={'user': 'john', 'pass': 'secret'}, 
            callback=self.logged_in)] 

    def logged_in(self, response): 
     # here you would extract links to follow and return Requests for 
     # each of them, with another callback 
     pass 
    ''' 
    allowed_domains = ["www.propertypriceregister.ie"] 
    start_urls = ('https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm',) 

    rules = (
     Rule(SgmlLinkExtractor(allow='/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView&Start=1&SearchMax=0&SearchOrder=4&Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011&County=&Year=2010&StartMonth=&EndMonth=&Address='), 
      callback='parse', 
      follow= True), 
    ) 

    def parse(self, response): 
     print response 
     pass 
+0

欢迎来到Stackoverflow!我建议你花更多的时间来正确地设置你的问题的格式,因为它提交了可怕的缩进,并且(仍然)包含大量冗余/注释掉的代码。如果你不能费力去研究这个问题,那么没有人会费心去回答它! – Rejected

+0

谢谢您拒绝您的反馈,我会确保我将问题格式化并投入必要的努力,以确保问题明确格式化。为质量差的问题道歉。 – user3607004

回答

1

在开始之前,请重新阅读Rule对象的工作方式。目前,您的规则将与网站永远不会显示链接的特定网址匹配(因为它采用的是表单帖子的格式)。

接下来,不要覆盖CrawlSpider的parse函数(实际上,根本不要使用它)。它由CrawlSpider在内部使用以实现功能(请参阅我提供的链接上的警告以获取更多详细信息)。

你需要为每一个元素的FormRequest被调用,类似这样的事情(注:未经测试,但它应该工作):

import itertools 
... # all your other imports here 

class SecondSpider(CrawlSpider): 
    name = 'second' 
    allowed_domains = ['propertypriceregister.ie', 'www.propertypriceregister.ie'] 

    rules = (
     Rule(LinkExtractor(allow=("/eStampUNID/UNID-")), callback='parse_search'), 
    ) 

    def start_requests(self): 
     years = [2010, 2011, 2012, 2013, 2014] 
     counties = ['County1', 'County2') 
     for county, year in itertools.product(*[counties, years]): 
      yield scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm", 
             formdata={'County': county, 'Year': year}, 
             dont_filter=True) 

    def parse_search(self, response): 
     # Parse response here 

从这一点来说,您的规则将应用于您从FormRequest返回的每个页面,以便从中拉取网址。如果您想要从最初的网址中抓取任何内容,请覆盖CrawlSpider的parse_start_url方法。

+0

谢谢你的回答,我会执行它并发布结果 – user3607004