2015-10-17 49 views
1

虽然网上刮我碰到下面的问题,对此我认为有可能是一个更好的解决方案:rvest | Webscraping数据为长格式

有这样的数据:

dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany")) 

       query 
1 Washington, USA 
2 Frankfurt, Germany 

我想查询例如Google Maps Api并返回格式化的地址(es)。可能有多种格式。结果应该是以下几点:

   query   formatted_address 
1 Washington, USA  Washington, DC, USA 
2 Washington, USA  Washington, UT, USA 
3 Washington, USA Washington, VA 22747, USA 
4 Washington, USA Washington, IA 52353, USA 
5 Washington, USA Washington, GA 30673, USA 
6 Washington, USA Washington, PA 15301, USA 
7 Frankfurt, Germany  Frankfurt, Germany 

我现在做的是这样的:

require(RCurl) 
require(rvest) 
require(magrittr) 

build_url <- function(x, base_url = "https://maps.googleapis.com/maps/api/geocode/xml?address="){ 
    paste0(base_url, RCurl::curlEscape(x)) 
} 

l <- lapply(dat$query, function(q){ 
    formatted_address <- q %>% build_url %>% read_xml %>% xml_nodes("formatted_address") %>% xml_text 
    data.frame(query = q, formatted_address) 
}) 

do.call(rbind, l) # This can be done via data.table::rbindlist as well 

有没有更好的解决办法?也许更多data.tabledplyr风格?

+1

请包括'library' /'require'呼吁让你的代码可重复 – jangorecki

+0

肯定。刚刚在data.frame创建时添加了'require'语句 – Rentrop

+2

,除了'stringsAsFactors = FALSE'之外,您已经优化了这个完美的IMO。我建议在lappl中添加一个'sleep',并确保将呼叫数量限制为2500或更少的IIRC([使用限制](https://developers.google.com/maps/documentation/business/articles/usage_limits)info)。 – hrbrmstr

回答

0

我已经编写了包googleway以使用有效的API密钥访问Google地图API(因此,如果您的数据超过2500个项目,您可以为API密钥付款)。

要获取详细地址使用google_geocode()

library(googleway) 

key <- "your_api_key" 

dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany")) 

## To get all the data: 
res <- apply(dat, 1, function(x){ 
    google_geocode(address = x["query"], 
       key = key) ## use simplify = F to return JSON 
}) 

## to access the 'formatted address' part, see 
res[[1]]$results$formatted_address 
# [1] "Washington, DC, USA"  "Washington, UT, USA"  "Washington, VA 22747, USA" "Washington, IA 52353, USA" 
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA" 

## so to get everything as a list 
lapply(res, function(x){ 
    x$results$formatted_address 
}) 

# [[1]] 
# [1] "Washington, DC, USA"  "Washington, UT, USA"  "Washington, VA 22747, USA" "Washington, IA 52353, USA" 
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA" 
# 
# [[2]] 
# [1] "Frankfurt, Germany" 

## and to put back onto your original data.frame: 
lst <- lapply(1:length(res), function(x){ 
    data.frame(query = dat[x, "query"], 
      formatted_address = res[[x]]$results$formatted_address) 
}) 

data.table::rbindlist(lst) 
#     query   formatted_address 
# 1: Washington, USA  Washington, DC, USA 
# 2: Washington, USA  Washington, UT, USA 
# 3: Washington, USA Washington, VA 22747, USA 
# 4: Washington, USA Washington, IA 52353, USA 
# 5: Washington, USA Washington, GA 30673, USA 
# 6: Washington, USA Washington, PA 15301, USA 
# 7: Frankfurt, Germany  Frankfurt, Germany