如何从Scrapy中提取网页中的所有内容

-2

我使用Scapy1.4通过指定一组URL来从网页上抓取内容。我需要如何从页面中提取各种信息，例如URL的标题，正文。如何从Scrapy中提取网页中的所有内容

目前，我使用下面的URL

https://healthlibrary.epnet.com/GetContent.aspx?token=3bb6e77f-7239-4082-81fb-4aeb0064ca19&chunkiid=32905

而且我的代码是

class gsapocSpider(BaseSpider): 
    name = "gsapoc" 
    start_urls =["https://healthlibrary.epnet.com/GetContent.aspx?token=3bb6e77f-7239-4082-81fb-4aeb0064ca19&chunkiid=32905"] 
    def parse(self, response): 
     responseSelector = Selector(response) 
     for sel in responseSelector.xpath('//ul/li'): 
      item = GsapocItem() 
      item['title'] = sel.xpath('//ul/li/a/text()').extract() 
      item['link'] = sel.xpath('a/@href').extract() 
      item['body'] = sel.xpath('//body//p//text()').extract() 
      #item['text'] = sel.xpath('//text()').extract() 
      #body = response.xpath('//body//p//text()').extract() 
      #print(body) 
      yield item

来源

2017-09-26 Shankar Rao

我不明白为什么设置XPath表达这样。页面中甚至没有ul元素。

由于您的目标只是为了获取网址，标题和正文。以下是一些建议：

URL。您可以从response获取URL response.url
标题。根据您要查找的标题类型，有两种选择：title标记和指定的元素。
身体。你想要整个页面还是仅仅是文本？如果前者，response.body没问题，并且如果后者，您需要指定如何提取所有内容。

无论如何，我认为你需要一些关于HTML和XPath的知识。

谢谢。

来源

2017-09-28 21:45:16 rojeeer

如何从Scrapy中提取网页中的所有内容

回答

相关问题