如何在解析html页面来替换数据框时替换NA值？

当试图解析一个html页面时，我们可以得到NA值。因此，当我们尝试使用列表中的数据构建数据框时，缺少值将无法实现。如何在解析html页面来替换数据框时替换NA值？

有什么简单的方法可以成功。请看下面的例子：

library(rvest) 
library(RCurl) 
library(XML) 

pg <- getURL("https://agences.axa.fr/ile-de-france/paris/paris-19e-75019") 
page = htmlTreeParse(pg,useInternal = TRUE,encoding="UTF-8") 
unlist(xpathApply(page,'//b[@class="Name"]',xmlValue)) 
data.frame(noms = unlist(xpathApply(page,'//b[@class="Name"]',xmlValue)), 
      rue = unlist(xpathApply(page,'//span[@class="street-address"]',xmlValue)))

来源

2017-03-17 XR SC

因为您使用'html_node'而不是'html_nodes'。此外，RCurl在这里是不必要的;你可以直接将URL传递给'read_html'。 – alistaire

谢谢@alistaire，我修改了我的问题，因为最初的一个很愚蠢。对于这个问题，我已经问过类似的问题：http://stackoverflow.com/questions/42588717/how-to-return-na-when-nothing-is-found-in-an-xpath，根据您对其他问题的回答，您可以成功找到解决方案。 –

更好的问题。但是，您应该显示加载XML，以获得可重复性。 – alistaire

使用rvest和purrr（的tidyverse包列表/功能编程，这对非常很好地与rvest）

library(rvest) 
library(purrr) 

# be nice, only scrape once 
h <- 'https://agences.axa.fr/ile-de-france/paris/paris-19e-75019' %>% read_html() 

df <- h %>% 
    # select each list item 
    html_nodes('div.ListConseiller li') %>% 
    # for each item, make a list of parsed name and street; coerce results to data.frame 
    map_df(~list(nom = .x %>% html_node('b.Name') %>% html_text(), 
       rue = .x %>% html_node('span.street-address') %>% html_text(trim = TRUE))) 

df 
#> # A tibble: 14 × 2 
#>       nom      rue 
#>      <chr>     <chr> 
#> 1   Marie France Tmim     <NA> 
#> 2    Rachel Tobie     <NA> 
#> 3    Bernard Licha     <NA> 
#> 4    David Giuili     <NA> 
#> 5  Myriam Yajid Khalfi     <NA> 
#> 6    Eytan Elmaleh     <NA> 
#> 7   Allister Charles     <NA> 
#> 8    Serge Savergne 321 Rue De Belleville 
#> 9   Patrick Allouche   1 Rue Clavel 
#> 10    Anne Fleiter 14 Avenue De Laumiere 
#> 11    Eric Fitoussi     <NA> 
#> 12 Jean-Baptiste Crocombette 1 Bis Rue Emile Desvaux 
#> 13    Eric Zunino 14 Rue De Thionville 
#> 14    Eric Hayoun     <NA>

代码使用CSS选择为但如果您愿意，可以通过参数html_nodes和html_node的xpath参数使用XPath参数。

来源

2017-03-17 23:21:06 alistaire

谢谢@alistaire，现在我开始明白了，顺便说一下，我们也可以使用xpath：'xpath ='// div [@ class =“ListConseiller”] // li'' :) –

也许我可以删除这个问题，它在这里是一样的：http://stackoverflow.com/questions/42588717/how-to-return-na-when-nothing-is-found-in-an-xpath –

你能告诉我为什么这不起作用？ 'df <- h %>％ html_nodes（'div.ListConseiller li'）％>％ map_df（〜list（nom = .x％>％html_nodes（xpath ='// b [@ class =“Name”]'）％> ％html_text（）， rue = .x％>％html_nodes（xpath ='// span [@ class =“street-address”]'）％>％html_text（trim = TRUE）））'或者我怎么找到与'xpath'等价的方式？ –

如何在解析html页面来替换数据框时替换NA值？

回答

相关问题