2015-11-20 161 views
1

我抓取和使用scrapy从以下网站刮数据上的一些信息:Scrapy:无法凑网站

http://www.glassdoor.com/Job/jobs.htm?suggestCount=4&suggestChosen=true&clickSource=searchBtn&typedKeyword=data+scien&headSiteSrch=%2FJob%2Fjobs.htm&sc.keyword=data+scientist&locT=&locId=

以下是我的目标:

  1. 进入每个页面
  2. 在每个页面中,抓取所有链接结果
  3. 进入#2的每个链接并抓取数据

我可以做所有3个,但被卡在一些数据中。 作为一个例子,下面是一个网页的链接,我要刮:

http://www.glassdoor.com/job-listing/lead-data-scientist-director-of-data-science-marketing-cloud-platform-affinity-solutions-JV_IC1147436_KO0,69_KE70,88.htm?jl=1537438396

我能够使用以下的XPath页面顶部刮职务,公司名称和位置:

item['Company'] = response.xpath('//span[@class = "ib"]/text()').extract() 
item['jobTitle'] = response.xpath('//div[@class = "header cell info"]/h2/text()').extract() 
item['Location'] = response.xpath('//span[@class = "subtle ib"]/text()').extract() 

但是,我无法从“公司信息”部分获取信息。 下面是我的代码刮网站,尺寸,总部和行业:

item['Website'] = response.xpath('//div[@id="InfoDetails"]/div[1]/span[@class = "empData website"]/a/@href').extract() 
item['HQ'] = response.xpath('//div[@id="InfoDetails"]/div[2]/span[@class = "empData"]/text()').extract() 
item['Size'] = response.xpath('//div[@id="InfoDetails"]/div[3]/span[@class = "empData"]/text()').extract() 
item['Industry'] = response.xpath('//div[@id="InfoDetails"]/div[6]/span/tt/text()').extract() 

我不知道为什么这最后4周的XPath不工作。

感谢您的帮助。

+1

您正在抓取的网页是动态的(需要由JavaScript引擎呈现)。 Scrapy只能看到简单的源代码。 – kev

+0

@kev是正确的,网页使用XHR调用'http://www.glassdoor.com/Overview/companyOverviewBasicInfoAjax.htm?&employerId = 20496&title = Company + Info&linkCompetitors = true'来加载公司的其他信息。 '20496'编号可以在页面HTML源代码中找到。 –

回答

0

我知道我很晚了,但以防万一别人需要它。 Glassdoor动态生成这些属性,所以我使用了splash请求来处理它们。 下面是代码:

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy_splash import SplashRequest 
id = 1 
class GlassdoorData(scrapy.Spider): 
name = 'glassdoordata' 
#allowed_domains = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm'] 
start_urls = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm'] 
def start_requests(self): 
    for url in self.start_urls: 
     yield SplashRequest(
     url, 
     self.parse, 
     args={'wait': 10}, 
     ) 

def parse(self, response): 
    #main_url = "https://www.glassdoor.ca" 
    urls = response.css('li.jl > div > div.flexbox > div > a::attr(href)').extract() 

    for url in urls:    
      url = "https://www.glassdoor.ca" + url 
      yield SplashRequest(url = url, callback = self.parse_details,args={'wait': 10}) 
    global id 
    id = id+1 
    #if id < 2 : 
    next_page_url = "https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11_IP{}.htm".format(id) 
    if next_page_url: 

     #next_page_url = response.urljoin(next_page_url) 
     #self.log("reached22: "+ next_page_url) 

     yield SplashRequest(url = next_page_url, callback = self.parse,args={'wait': 10},) 



def parse_details(self,response): 
    yield{ 
     'Job_Title' : response.css('div.header.cell.info > h2::text').extract_first(), 
     'Company' : response.css('div.header.cell.info > span.ib::text').extract_first(), 
     'Location' : response.css('div.header.cell.info > span.subtle.ib::text').extract_first(), 
     'Website' : response.xpath("//div[@class = 'infoEntity']/span/a/text()").extract(), 
     'Size' : response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Size')]/following-sibling::span/text()").extract(), 
     'Industry' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Industry')]/following-sibling::span/text()").extract_first()).lstrip(), 
     'Type' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Type')]/following-sibling::span/text()").extract_first()).lstrip(), 
     'Revenue' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Revenue')]/following-sibling::span/text()").extract_first()).lstrip(), 
     'Competitors' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Competitors')]/following-sibling::span/text()").extract_first()).lstrip(), 

    } 

编辑settings.py这样的:

BOT_NAME = 'glassdoordata' 

SPIDER_MODULES = ['glassdoordata.spiders'] 
NEWSPIDER_MODULE = 'glassdoordata.spiders' 



# Obey robots.txt rules 
DOWNLOADER_MIDDLEWARES = { 
'scrapy_splash.SplashCookiesMiddleware': 723, 
'scrapy_splash.SplashMiddleware': 725, 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
} 

SPLASH_URL = 'http://192.168.99.100:8050' 

SPIDER_MIDDLEWARES = { 
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 
} 

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' 



ROBOTSTXT_OBEY = False 

你需要运行这个程序之前安装飞溅。

谢谢