2017-04-24 67 views
0

我有看起来像这样(在XML节点组)438个投手名称的列表:R - 如何从XML节点集中提取项目?

> pitcherlinks[[1]] 
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01"> 
    <a href="/players/a/abadfe01.shtml">Fernando Abad</a>* 
</td> 

> pitcherlinks[[2]] 
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01"> 
    <a href="/players/a/adlemti01.shtml">Tim Adleman</a> 
</td> 

我如何提取像Fernando Abad的名字和相同的/players/a/abadfe01.shtml

回答

1

相关链接既然你有一个列表,使用apply函数来遍历列表。每个函数使用read_html使用CSS选择器a解析列表中的hmtl片段以查找锚点(链接)。名字来自html_text,链接在属性href

library(rvest) 
pitcherlinks <- list() 
pitcherlinks[[1]] <- 
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01"> 
    <a href="/players/a/abadfe01.shtml">Fernando Abad</a>* 
    </td>' 

pitcherlinks[[2]] <- 
    '<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01"> 
    <a href="/players/a/adlemti01.shtml">Tim Adleman</a> 
     </td>' 

names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()}) 
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")}) 

names 
# [1] "Fernando Abad" "Tim Adleman" 
links 
# [1] "/players/a/abadfe01.shtml" "/players/a/adlemti01.shtml"