2017-10-04 48 views
0

我有一个艰难的时间获取价值,因为有些网页已经失踪标签:结果 - 猫Rvest:刮数据时元素不存在

我已经访问过这个问题here,但我仍然不能够抓取数据。

HTML

<div class="result "> 
    <span class="result-txt"> 

     <span class="result-name"> 
      <a href="/some/value/">COMPANY_NAME</a> 
      <a class="result-icons" href="/some/value/COMPANY_NAME_/"> 
       <span title="Info" class="sprite sprite-info">Info</span> 
       <span title="Phone" class="sprite sprite-phone">Phone</span> 
      </a> 
     </span> 

     <em> 
      <a href="/some/value/">LOCATION</a> 
      <span> ADDRESS </span> 
     </em> 

     <span class="result-cats"> 
      <a href="/some/value/" title="CAT1">CAT1</a> 
      <a href="/some/value/" title="CAT2">CAT2</a> 
     </span> 

    </span> 
</div> 

我想下面的代码,但它给我的错误,因为有些网页没有结果的猫标签。因此,数据帧具有向量长度的失配

代码

library(rvest) 
library(XML) 
library(stringi) 

df <- data.frame(CompanyName = NULL, CompanyLink = NULL, Address = NULL, cats = NULL) 

for(i in 1:100){ 

    print(paste("Page: ", i, sep = "")) 

    url <- "url.com" 
    page <- read_html(url) 

    CompanyNameNode <- html_nodes(page,'.result-name a:nth-child(1)') 
    CompanyName <- html_text(CompanyNameNode) 
    CompanyLink <- html_attr(CompanyNameNode, 'href') 

    Address <- html_text(html_nodes(page,'.result-txt em')) 
    Address <- gsub("[\r\n]", "", Address) 

    cats <- html_text(html_nodes(page,'.result-cats')) 
    cats <- stri_trim(cats) 
    cats <- gsub("[\r\n]", ", ", cats) 

    df <- rbind(df, data.frame(CompanyName = CompanyName, 
          CompanyLink = CompanyLink, 
          Address = Address, 
          cats = cats)) 

} 

UPDATE:使用以下代码

pg <- html_nodes(page,'.result-txt') 
cats <- NULL 

for(j in 1:length(pg)){ 
    cats[j] <- ifelse(length(html_text(html_nodes(pg[j],'.result-cats')))==0, 
        NA, 
        html_text(html_nodes(pg[j],'.result-cats'))) 
} 

cats <- stri_trim(cats) 
cats <- gsub("[\r\n]", ", ", cats) 

回答

1

使用以下代码

pg <- html_nodes(page,'.result-txt') 
cats <- NULL 

for(j in 1:length(pg)){ 
    cats[j] <- ifelse(length(html_text(html_nodes(pg[j],'.result-cats')))==0, 
        NA, 
        html_text(html_nodes(pg[j],'.result-cats'))) 
} 

cats <- stri_trim(cats) 
cats <- gsub("[\r\n]", ", ", cats) 
解决的问题已解决的问题