R：使用rvest包而不是XML包来从URL中获取链接

我使用XML包获取this url的链接。R：使用rvest包而不是XML包来从URL中获取链接

# Parse HTML URL 
v1WebParse <- htmlParse(v1URL) 
# Read links and and get the quotes of the companies from the href 
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

虽然这种方法是非常有效的，我用rvest，似乎在解析网页比XML更快。我试过html_nodes和html_attrs，但我无法使它工作。

来源

2014-12-04 capm

'rvest'使用'XML'包提取节点。它真的不应该更快。 – hrbrmstr 2014-12-04 17:18:41

尽管我的评论，这里是如何用rvest做到这一点。请注意，我们需要首先阅读htmlParse页面，因为该网站的内容类型设置为text/plain，并且该文件将rvest转化为眩晕。

library(rvest) 
library(XML) 

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat") 
pg %>% html_nodes("a") %>% html_attr("href") 

## [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html" 
## [3] "/inf_corporativa66100_ACESEGC1.html" "/inf_corporativa71300_ADCOMEC1.html" 
## ... 
## [273] "/inf_corporativa64801_VOLCAAC1.html" "/inf_corporativa58501_YURABC11.html" 
## [275] "/inf_corporativa98959_ZNC.html"

进一步示出rvest的XML包基础。

UPDATE

rvest::read_html()直接现在可以处理这个问题：

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

来源

2014-12-04 17:25:30 hrbrmstr

你说得对，节点提取'rvest'使用'XML'。我将在聊天中讨论我使用软件包的站点在时间上的差异。谢谢回复。 – capm 2014-12-30 06:02:15

我知道您正在寻找rvest答案，但这里有另一种方法，使用XML程序包，可能比您所做的更有效。

你见过example(htmlParse)的getLinks()函数吗？我从示例中使用此修改后的版本获取href链接。它是一个处理函数，所以我们可以在读取数据时收集这些值，节省内存并提高效率。

links <- function(URL) 
{ 
    getLinks <- function() { 
     links <- character() 
     list(a = function(node, ...) { 
       links <<- c(links, xmlGetAttr(node, "href")) 
       node 
      }, 
      links = function() links) 
     } 
    h1 <- getLinks() 
    htmlTreeParse(URL, handlers = h1) 
    h1$links() 
} 

links("http://www.bvl.com.pe/includes/empresas_todas.dat") 
# [1] "/inf_corporativa71050_JAIME1CP1A.html" 
# [2] "/inf_corporativa10400_INTEGRC1.html" 
# [3] "/inf_corporativa66100_ACESEGC1.html" 
# [4] "/inf_corporativa71300_ADCOMEC1.html" 
# [5] "/inf_corporativa10250_HABITAC1.html" 
# [6] "/inf_corporativa77900_PARAMOC1.html" 
# [7] "/inf_corporativa77935_PUCALAC1.html" 
# [8] "/inf_corporativa77600_LAREDOC1.html" 
# [9] "/inf_corporativa21000_AIBC1.html"  
# ... 
# ...

来源

2014-12-04 15:29:59

伟大的帮助，我没有检查'htmlParse'中的例子，但我修改了我的代码与您的建议。在这种情况下，'XML'工作的很好，但从这个[web]（http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1）获取历史价格所需的时间比' rvest'确实。 – capm 2014-12-04 16:22:05

价格？您的问题表明您正在尝试获取链接 – 2014-12-22 05:58:25

是的，来自[此网页]（http://www.bvl.com.pe/includes/empresas_todas.dat）我试图从网站获取所有链接，而在[本网站]（http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1）我尝试解析包含SIDERC1报价的历史价格的表格。我在这两个网站上都使用了“XML”，但我只能在后者上使用'rvest'。 – capm 2014-12-30 05:23:55

# Option 1 
library(RCurl) 
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat') 

# Option 2 
library(rvest) 
library(pipeR) # %>>% will be faster than %>% 
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")

来源

2015-01-29 19:26:23

选项1似乎不再适用于当前版本的RCurl。 – 2017-03-27 17:17:06

理查德的回答适用于HTTP页，但不是HTTPS页面，我需要（维基百科）。我用RCurl的getURL函数取代如下：

library(RCurl) 

links <- function(URL) 
{ 
    getLinks <- function() { 
    links <- character() 
    list(a = function(node, ...) { 
     links <<- c(links, xmlGetAttr(node, "href")) 
     node 
    }, 
    links = function() links) 
    } 
    h1 <- getLinks() 
    xData <- getURL(URL) 
    htmlTreeParse(xData, handlers = h1) 
    h1$links() 
}

来源

2016-04-26 20:43:58 bshor

R：使用rvest包而不是XML包来从URL中获取链接

回答

相关问题