2016-11-21 75 views
2

我创建从下面的新闻源RSS http://indianexpress.com/section/india/feed/无法凑新闻网站

我从这个XML

  • 标题
  • 标题URL
  • 出版日期阅读下面的数据的数据集

我现在使用标题url来获取de (摘要,在主标题下方) - 通过点击每个网址并获取数据

但是,我正面临着向量长度(197)与其他(200)的描述不匹配。 因为这个我无法创建我的数据帧

有人能帮助我如何能有效地刮去数据

下面的代码是可重复的

library("httr") 
library("RCurl") 
library("jsonlite") 
library("lubridate") 
library("rvest") 
library("XML") 
library("stringr") 

url = "http://indianexpress.com/section/india/feed/" 

newstopics = getURL(url) 

newsxml = xmlParse(newstopics) 

title <- xpathApply(newsxml, "//item/title", xmlValue) 
title <- unlist(title) 

titleurl <- xpathSApply(newsxml, '//item/link', xmlValue) 
pubdate <- xpathSApply(newsxml, '//item/pubDate', xmlValue) 

t1 = Sys.time() 
desc <- NULL 

for (i in 1:length(titleurl)){ 

    page = read_html(titleurl[i]) 
    temp = html_text(html_nodes(page,'.synopsis')) 
    desc = c(desc,temp) 

} 

print(difftime(Sys.time(), t1, units = 'sec')) 

desc = gsub("\n",' ',desc) 

newsdata = data.frame(title,titleurl,desc,pubdate) 

我收到以下错误:

Error in data.frame(title, titleurl, desc, pubdate) : 
arguments imply differing number of rows: 200, 197 
+0

我认为这个问题是关系到'temp'不会为'for'循环中的每个迭代返回一个值。尝试用'desc = c(desc,paste0(“”,temp))'替换'desc'行 - 尽管更优雅的错误处理是期望的。 – JasonAizkalns

+0

我检查了titleurl在任何地方都不为空。我假设由于每个网址都是一个报纸链接,他们肯定会有一个副标题 –

回答

0

您可以执行以下操作:

library(tidyverse) 
library(xml2) 
library(rvest) 

feed <- read_xml("http://indianexpress.com/section/india/feed/") 

# helper function to extract information from the item node 
item2vec <- function(item){ 
    tibble(title = xml_text(xml_find_first(item, "./title")), 
     link = xml_text(xml_find_first(item, "./link")), 
     pubDate = xml_text(xml_find_first(item, "./pubDate"))) 
} 

dat <- feed %>% 
    xml_find_all("//item") %>% 
    map_df(item2vec) 

# The following takes a while 
dat <- dat %>% 
    mutate(desc = map_chr(dat$link, ~read_html(.) %>% html_node('.synopsis') %>% html_text)) 

它给你data.frame/tibble有4列:

> glimpse(dat) 
Observations: 200 
Variables: 4 
$ title <chr> "Common man has no problem with note ban, says Santosh Gangwar", "Bombay High Court comes... 
$ link <chr> "http://indianexpress.com/article/india/india-news-india/demonetisation-note-ban-cash-cru... 
$ pubDate <chr> "Mon, 21 Nov 2016 20:04:21 +0000", "Mon, 21 Nov 2016 20:01:43 +0000", "Mon, 21 Nov 2016 1... 
$ desc <chr> "MoS for Finance speaks to Indian Express in Bareilly, his Lok Sabha constituency.", "The... 

PS:为了让每item的所有信息,你可以使用:

dat <- feed %>% 
    xml_find_all("//item") %>% 
    map_df(~xml_children(.) %>% {set_names(xml_text(.), xml_name(.))} %>% t %>% as_tibble)