Scrapy CrawlSpider + Splash：如何通过linkextractor关注链接？

我有以下的代码，部分工作，Scrapy CrawlSpider + Splash：如何通过linkextractor关注链接？

class ThreadSpider(CrawlSpider): 
    name = 'thread' 
    allowed_domains = ['bbs.example.com'] 
    start_urls = ['http://bbs.example.com/diy'] 

    rules = (
     Rule(LinkExtractor(
      allow=(), 
      restrict_xpaths=("//a[contains(text(), 'Next Page')]") 
     ), 
      callback='parse_item', 
      process_request='start_requests', 
      follow=True), 
    ) 

def start_requests(self): 
    for url in self.start_urls: 
     yield SplashRequest(url, self.parse_item, args={'wait': 0.5}) 

def parse_item(self, response): 
    # item parser

代码将只能运行于start_urls但不会遵循restricted_xpaths指定的链接，如果是我的规则注释掉start_requests()方法和线process_request='start_requests',，它会运行并遵循预期的链接，当然没有js渲染。

我已经阅读了两个相关问题，CrawlSpider with Splash getting stuck after first URL和CrawlSpider with Splash，特别在start_requests()方法改变scrapy.Request()到SplashRequest()，但似乎并没有工作。我的代码有什么问题？谢谢，

来源

2017-08-25 eN_Joy

使用下面的代码 - 只需复制和粘贴

restrict_xpaths=('//a[contains(text(), "Next Page")]')

而不是

restrict_xpaths=("//a[contains(text(), 'Next Page')]")

来源

2017-08-26 16:55:20 Kapil

这似乎没有帮助。记住'restrict_xpaths =（“// a [contains（text（），'Next Page'）]”）'行''''如果我注释掉'start_requests（）'就行。任何方式我意识到这是一个未解决的问题，许多用户在这里报告：https：//github.com/scrapy-plugins/scrapy-splash/issues/92 –

我已经似乎特定于和Scrapy CrawlSpider整合飞溅类似的问题。它只会访问启动网址，然后关闭。我设法实现它的唯一方式是不使用scrapy-splash插件，而是使用'process_links'方法将Splash http api url预先添加到scrapy收集的所有链接。然后我做了其他调整，以补偿由此方法产生的新问题。以下是我所做的：

您需要将这两个工具放在一起，然后将它拆开，如果您打算将它存储在某处。

from urllib.parse import urlencode, parse_qs

随着飞溅的URL被preppended到每一个环节，scrapy会过滤他们全部为“场外域的请求”，所以我们做使“localhost”的允许域。

allowed_domains = ['localhost'] 
start_urls = ['https://www.example.com/']

但是，这提出了一个问题，因为当我们只想抓取一个网站时，我们可能会无止境地抓取网页。让我们用LinkExtractor规则来解决这个问题。通过仅从我们期望的域中获取链接，我们解决了异地请求问题。

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')), 
process_links='process_links',

以下是process_links方法。 urlencode方法中的字典是您将放置所有启动参数的位置。

def process_links(self, links): 
    for link in links: 
     if "http://localhost:8050/render.html?&" not in link.url: 
      link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url, 
                      'wait':2.0}) 
    return links

最后，要将url从splash url中取出，请使用parse_qs方法。

parse_qs(response.url)['url'][0]

关于这种方法的最后一个注意事项。您会注意到，我在开始时在启动网址中有'&'。（... render.html？&）。这使得在使用urlencode方法时，无论您使用哪个参数的顺序，都可以解析启动url以取出实际的url一致性。

来源

2017-12-11 16:17:15

Scrapy CrawlSpider + Splash：如何通过linkextractor关注链接？

回答

相关问题