2017-08-04 151 views
0

抓取以下页面:http://graphics.stltoday.com/apps/payrolls/salaries/teachers/detail/25074/我试图从表格中获取每个值(薪水,职位,年份和区域等)。当我尝试从scrapy shell访问这些文件时,它们全都显示在我使用response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract()时。但是,当我在爬虫程序中执行此操作时,只显示第一个元素(区域)。有什么建议么?XPath:TR内的第N个TD

履带代码(理想情况下,每个元素会进入自己的变量干净的输出:

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class Spider2(CrawlSpider): 
    #name of the spider 
    name = 'stlteacher' 

    #list of allowed domains 
    allowed_domains = ['graphics.stltoday.com'] 

    #starting url for scraping 
    start_urls = ['http://graphics.stltoday.com/apps/payrolls/salaries/teachers/'] 
    rules = [ 
    Rule(LinkExtractor(
     allow=['/apps/payrolls/salaries/teachers/[0-9]+/$']), 
     follow=True), 
    Rule(LinkExtractor(
     allow=['/apps/payrolls/salaries/teachers/[0-9]+/position/[0-9]+/$']), 
     follow=True), 
    Rule(LinkExtractor(
     allow=['/apps/payrolls/salaries/teachers/detail/[0-9]+/$']), 
     callback='parse_item', 
     follow=True), 
    ] 

    #setting the location of the output csv file 
    custom_settings = { 
     'FEED_FORMAT' : "csv", 
     'FEED_URI' : 'tmp/stlteachers3.csv' 
    } 

    def parse_item(self, response): 
     #Remove XML namespaces 
     response.selector.remove_namespaces() 

     #Extract article information 
     url = response.url 
     name = response.xpath('//p[@class="table__title"]/text()').extract() 
     district = response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract() 



     for item in zip(name, district): 
      scraped_info = { 
       'url' : url, 
       'name' : item[0], 
       'district' : item[1], 

      } 
      yield scraped_info 
+0

难道是你正在爬行的一些页面在该字段中只有1个值? – Granitosaurus

回答

3

zip是一个有点混乱,还有如果你想抓取整个表,那么你需要遍历。表行,并找到列名和值

我买了这段代码不错的成果:

def parse_item(self, response): 
    name = response.xpath('//p[@class="table__title"]/text()').extract_first() 
    item = { 
     'name': name, 
     'url': response.url 
    } 
    for row in response.xpath('//th[@scope="row"]'): 
     row_name = row.xpath('text()').extract_first('').lower().strip(':') 
     row_value = row.xpath('following-sibling::td[1]/text()').extract_first() 
     item[row_name] = row_value 
    yield item 

这将返回:

{ 
    'name': 'Bracht, Nathan', 
    'url': 'http://graphics.stltoday.com/apps/payrolls/salaries/teachers/detail/25074/', 
    'district': 'Affton 101', 
    'school': 'Central Office', 
    'position': 'Central Office Admin.', 
    'degree earned': 'Doct', 
    'salary': '$152,000.00', 
    'extended contract pay': None, 
    'extra duty pay': None, 
    'total pay (all combined)': '$152,000.00', 
    'years in district': '5', 
    'years in mo schools': '19', 
    'multiple position detail': None 
}