Scrapy：无法凑网站

我抓取和使用scrapy从以下网站刮数据上的一些信息：Scrapy：无法凑网站

http://www.glassdoor.com/Job/jobs.htm?suggestCount=4&suggestChosen=true&clickSource=searchBtn&typedKeyword=data+scien&headSiteSrch=%2FJob%2Fjobs.htm&sc.keyword=data+scientist&locT=&locId=

以下是我的目标：

进入每个页面
在每个页面中，抓取所有链接结果
进入＃2的每个链接并抓取数据

我可以做所有3个，但被卡在一些数据中。作为一个例子，下面是一个网页的链接，我要刮：

http://www.glassdoor.com/job-listing/lead-data-scientist-director-of-data-science-marketing-cloud-platform-affinity-solutions-JV_IC1147436_KO0,69_KE70,88.htm?jl=1537438396

我能够使用以下的XPath页面顶部刮职务，公司名称和位置：

item['Company'] = response.xpath('//span[@class = "ib"]/text()').extract() 
item['jobTitle'] = response.xpath('//div[@class = "header cell info"]/h2/text()').extract() 
item['Location'] = response.xpath('//span[@class = "subtle ib"]/text()').extract()

但是，我无法从“公司信息”部分获取信息。下面是我的代码刮网站，尺寸，总部和行业：

item['Website'] = response.xpath('//div[@id="InfoDetails"]/div[1]/span[@class = "empData website"]/a/@href').extract() 
item['HQ'] = response.xpath('//div[@id="InfoDetails"]/div[2]/span[@class = "empData"]/text()').extract() 
item['Size'] = response.xpath('//div[@id="InfoDetails"]/div[3]/span[@class = "empData"]/text()').extract() 
item['Industry'] = response.xpath('//div[@id="InfoDetails"]/div[6]/span/tt/text()').extract()

我不知道为什么这最后4周的XPath不工作。

感谢您的帮助。

来源

2015-11-20 wi3o

您正在抓取的网页是动态的（需要由JavaScript引擎呈现）。 Scrapy只能看到简单的源代码。 – kev

@kev是正确的，网页使用XHR调用'http：//www.glassdoor.com/Overview/companyOverviewBasicInfoAjax.htm？＆employerId = 20496＆title = Company + Info＆linkCompetitors = true'来加载公司的其他信息。 '20496'编号可以在页面HTML源代码中找到。 –

大多数爬虫不呈现JavaScript。要完成渲染，您需要使用JavaScript渲染引擎。如果你被绑定到python，那么我建议using Splash with scrapy as talked about here。其他工具如phantomjs可以集成其他技术。

来源

2015-11-20 18:34:18 sjdirect

我知道我很晚了，但以防万一别人需要它。 Glassdoor动态生成这些属性，所以我使用了splash请求来处理它们。下面是代码：

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy_splash import SplashRequest 
id = 1 
class GlassdoorData(scrapy.Spider): 
name = 'glassdoordata' 
#allowed_domains = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm'] 
start_urls = ['https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11.htm'] 
def start_requests(self): 
    for url in self.start_urls: 
     yield SplashRequest(
     url, 
     self.parse, 
     args={'wait': 10}, 
     ) 

def parse(self, response): 
    #main_url = "https://www.glassdoor.ca" 
    urls = response.css('li.jl > div > div.flexbox > div > a::attr(href)').extract() 

    for url in urls:    
      url = "https://www.glassdoor.ca" + url 
      yield SplashRequest(url = url, callback = self.parse_details,args={'wait': 10}) 
    global id 
    id = id+1 
    #if id < 2 : 
    next_page_url = "https://www.glassdoor.ca/Job/canada-data-jobs-SRCH_IL.0,6_IN3_KE7,11_IP{}.htm".format(id) 
    if next_page_url: 

     #next_page_url = response.urljoin(next_page_url) 
     #self.log("reached22: "+ next_page_url) 

     yield SplashRequest(url = next_page_url, callback = self.parse,args={'wait': 10},) 



def parse_details(self,response): 
    yield{ 
     'Job_Title' : response.css('div.header.cell.info > h2::text').extract_first(), 
     'Company' : response.css('div.header.cell.info > span.ib::text').extract_first(), 
     'Location' : response.css('div.header.cell.info > span.subtle.ib::text').extract_first(), 
     'Website' : response.xpath("//div[@class = 'infoEntity']/span/a/text()").extract(), 
     'Size' : response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Size')]/following-sibling::span/text()").extract(), 
     'Industry' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Industry')]/following-sibling::span/text()").extract_first()).lstrip(), 
     'Type' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Type')]/following-sibling::span/text()").extract_first()).lstrip(), 
     'Revenue' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Revenue')]/following-sibling::span/text()").extract_first()).lstrip(), 
     'Competitors' : (response.xpath("//div[@class = 'infoEntity']/label[contains(text(),'Competitors')]/following-sibling::span/text()").extract_first()).lstrip(), 

    }

编辑settings.py这样的：

BOT_NAME = 'glassdoordata' 

SPIDER_MODULES = ['glassdoordata.spiders'] 
NEWSPIDER_MODULE = 'glassdoordata.spiders' 



# Obey robots.txt rules 
DOWNLOADER_MIDDLEWARES = { 
'scrapy_splash.SplashCookiesMiddleware': 723, 
'scrapy_splash.SplashMiddleware': 725, 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
} 

SPLASH_URL = 'http://192.168.99.100:8050' 

SPIDER_MIDDLEWARES = { 
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 
} 

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' 



ROBOTSTXT_OBEY = False

你需要运行这个程序之前安装飞溅。

谢谢

来源

2017-10-29 13:33:50

Scrapy：无法凑网站

回答

相关问题