0
抓取以下页面:http://graphics.stltoday.com/apps/payrolls/salaries/teachers/detail/25074/我试图从表格中获取每个值(薪水,职位,年份和区域等)。当我尝试从scrapy shell访问这些文件时,它们全都显示在我使用response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract()
时。但是,当我在爬虫程序中执行此操作时,只显示第一个元素(区域)。有什么建议么?XPath:TR内的第N个TD
履带代码(理想情况下,每个元素会进入自己的变量干净的输出:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Spider2(CrawlSpider):
#name of the spider
name = 'stlteacher'
#list of allowed domains
allowed_domains = ['graphics.stltoday.com']
#starting url for scraping
start_urls = ['http://graphics.stltoday.com/apps/payrolls/salaries/teachers/']
rules = [
Rule(LinkExtractor(
allow=['/apps/payrolls/salaries/teachers/[0-9]+/$']),
follow=True),
Rule(LinkExtractor(
allow=['/apps/payrolls/salaries/teachers/[0-9]+/position/[0-9]+/$']),
follow=True),
Rule(LinkExtractor(
allow=['/apps/payrolls/salaries/teachers/detail/[0-9]+/$']),
callback='parse_item',
follow=True),
]
#setting the location of the output csv file
custom_settings = {
'FEED_FORMAT' : "csv",
'FEED_URI' : 'tmp/stlteachers3.csv'
}
def parse_item(self, response):
#Remove XML namespaces
response.selector.remove_namespaces()
#Extract article information
url = response.url
name = response.xpath('//p[@class="table__title"]/text()').extract()
district = response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract()
for item in zip(name, district):
scraped_info = {
'url' : url,
'name' : item[0],
'district' : item[1],
}
yield scraped_info
难道是你正在爬行的一些页面在该字段中只有1个值? – Granitosaurus