2014-08-30 49 views
1

更多的文章我想刮使用网站scrapy, 我的蜘蛛如下:与岗位要求的工作负载与scrapy蟒蛇

class mySpider(CrawlSpider): 
    name = "mytest" 
    allowed_domains = {'www.example.com'} 
    start_urls = ['http://www.example.com'] 

    rules = [ 
    Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback = 'parse_post', 
    follow= True) 
    ] 

    def parse_post(self, response): 
     item = PostItem() 

     item['url'] = response.url 

     item['title'] = response.xpath('//title/text()').extract() 
     item['authors'] = response.xpath('//span[@class="author"]/text()').extract() 

     return item 

一切工作正常,但它只是擦伤的网页的链接。它允许加载更多的文章与发布请求,即'点击更多文章'。 是否有反正我可以模拟加载更多的文章按钮来加载文章,并继续刮板?

+0

这取决于如何 “的文章” 链接实际工作。你能分享到网站的实际链接吗? – alecxe 2014-08-30 14:30:06

+0

@alecxe它的ijreview.com – Anish 2014-08-30 14:30:33

回答

2

“加载更多文章”按钮由JavaScript管理,点击ti激发AJAX发布请求。

换句话说,这是Scrapy不能轻易处理的东西。

但是,如果Scrapy不是必需的,这里是用requestsBeautifulSoup一个解决方案:

from bs4 import BeautifulSoup 
import requests 


url = "http://www.ijreview.com/wp-admin/admin-ajax.php" 
session = requests.Session() 
page_size = 24 

params = { 
    'action': 'load_more', 
    'numPosts': page_size, 
    'category': '', 
    'orderby': 'date', 
    'time': '' 
} 

offset = 0 
limit = 100 
while offset < limit: 
    params['offset'] = offset 
    response = session.post(url, data=params) 
    links = [a['href'] for a in BeautifulSoup(response.content).select('li > a')] 
    for link in links: 
     response = session.get(link) 
     page = BeautifulSoup(response.content) 
     title = page.find('title').text.strip() 
     author = page.find('span', class_='author').text.strip() 
     print {'link': link, 'title': title, 'author': author} 

    offset += page_size 

打印:

{'author': u'Kevin Boyd', 'link': 'http://www.ijreview.com/2014/08/172770-president-obama-realizes-world-messy-place-thanks-social-media/', 'title': u'President Obama Calls The World A Messy Place & Blames Social Media for Making People Take Notice'} 
{'author': u'Reid Mene', 'link': 'http://www.ijreview.com/2014/08/172405-17-politicians-weird-jobs-time-office/', 'title': u'12 Most Unusual Professions of Politicians Before They Were Elected to Higher Office'} 
{'author': u'Michael Hausam', 'link': 'http://www.ijreview.com/2014/08/172653-video-duty-mp-fakes-surrender-shoots-hostage-taker/', 'title': u'Video: Off-Duty MP Fake Surrenders at Gas Station Before Revealing Deadly Surprise for Hostage Taker'} 
... 

您可能需要调整的代码,以便它支持不同类别,排序等您还可以通过允许BeautifulSoup使用lxml解析器引擎盖内 - 而不是BeautifulSoup(response.content),使用BeautifulSoup(response.content, "lxml"),但您woul d需要安装lxml


这是你如何调整解决Scrapy:

import urllib 
from scrapy import Item, Field, Request, Spider 

class mySpider(Spider): 
    name = "mytest" 
    allowed_domains = {'www.ijreview.com'} 

    def start_requests(self): 
     page_size = 25 
     headers = {'User-Agent': 'Scrapy spider', 
        'X-Requested-With': 'XMLHttpRequest', 
        'Host': 'www.ijreview.com', 
        'Origin': 'http://www.ijreview.com', 
        'Accept': '*/*', 
        'Referer': 'http://www.ijreview.com/', 
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'} 
     for offset in (0, 200, page_size): 
      yield Request('http://www.ijreview.com/wp-admin/admin-ajax.php', 
          method='POST', 
          headers=headers, 
          body=urllib.urlencode(
           {'action': 'load_more', 
           'numPosts': page_size, 
           'offset': offset, 
           'category': '', 
           'orderby': 'date', 
           'time': ''})) 

    def parse(self, response): 
     for link in response.xpath('//ul/li/a/@href').extract(): 
      yield Request(link, callback=self.parse_post) 

    def parse_post(self, response): 
     item = PostItem() 

     item['url'] = response.url 
     item['title'] = response.xpath('//title/text()').extract()[0].strip() 
     item['authors'] = response.xpath('//span[@class="author"]/text()').extract()[0].strip() 

     return item 

输出:

{'authors': u'Kyle Becker', 
'title': u'17 Reactions to the \u2018We Don\u2019t Have a Strategy\u2019 Gaffe That May Haunt the Rest of Obama\u2019s Presidency', 
'url': 'http://www.ijreview.com/2014/08/172569-25-reactions-obamas-dont-strategy-gaffe-may-haunt-rest-presidency/'} 

... 
+0

看起来不错,但你能给我一些与scrapy绑定的想法吗? – Anish 2014-08-30 14:50:31

+0

@Ngeunpo当然,我已经为Scrapy添加了一个示例调整。希望你能用这个基础和改进。 – alecxe 2014-08-30 15:10:29

+0

@alecxe Request的一个子类是[FormRequest](http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests),这也可能有助于这个案例。 – 2014-09-03 04:50:35