2015-03-03 32 views

回答

0

你可以把所有表成为一个广泛的数据帧与列表操作:

library(rvest) 
library(magrittr) 
library(dplyr) 

date <- 20130701 
rng <- c(1:4) 

my_tabs <- lapply(rng, function(i) { 
    url <- sprintf("http://apims.doe.gov.my/apims/hourly%d.php?date=%s", i, date) 
    pg <- html(url) 
    pg %>% html_nodes("table") %>% extract2(1) %>% html_table(header=TRUE) 
}) 

glimpse(plyr::join_all(my_tabs, by=colnames(my_tabs[[1]][1:2]))) 

## Observations: 52 
## Variables: 
## $ NEGERI/STATE (chr) "Johor", "Johor", "Johor", "Johor", "Kedah... 
## $ KAWASAN/AREA  (chr) "Kota Tinggi", "Larkin Lama", "Muar", "Pas... 
## $ MASA/TIME12:00AM (chr) "63*", "53*", "51*", "55*", "37*", "48*", ... 
## $ MASA/TIME01:00AM (chr) "62*", "52*", "52*", "55*", "36*", "48*", ... 
## $ MASA/TIME02:00AM (chr) "61*", "51*", "53*", "55*", "35*", "48*", ... 
## $ MASA/TIME03:00AM (chr) "60*", "50*", "54*", "55*", "35*", "48*", ... 
## $ MASA/TIME04:00AM (chr) "59*", "49*", "54*", "54*", "34*", "47*", ... 
## $ MASA/TIME05:00AM (chr) "58*", "48*", "54*", "54*", "34*", "45*", ... 
## $ MASA/TIME06:00AM (chr) "57*", "47*", "53*", "53*", "33*", "45*", ... 
## $ MASA/TIME07:00AM (chr) "57*", "46*", "52*", "53*", "32*", "45*", ... 
## $ MASA/TIME08:00AM (chr) "56*", "45*", "52*", "52*", "32*", "44*", ... 
## ... 

我很少实际加载/使用plyr不再因与命名冲突dplyr,但join_all非常适合这种情况。

它也可能你会需要这个数据在长格式:

plyr::join_all(my_tabs, by=colnames(my_tabs[[1]][1:2])) %>% 
    tidyr::gather(masa, nilai, -1, -2) %>% 
# better column names 
    rename(nigeri=`NEGERI/STATE`, kawasan=`KAWASAN/AREA`) %>% 
# cleanup & convert time (using local timezone) 
# make readings numeric; NA will sub for # 
    mutate(masa=gsub("MASA/TIME", "", masa), 
     masa=as.POSIXct(sprintf("%s %s", date, masa), format="%Y%m%d %H:%M%p", tz="Asia/Kuala_Lumpur"), 
     nilai=as.numeric(gsub("[[:punct:]]+", "", nilai))) -> pollut 

head(pollut) 
## nigeri     kawasan    masa nilai 
## 1 Johor    Kota Tinggi 2013-07-01 12:00:00 63 
## 2 Johor    Larkin Lama 2013-07-01 12:00:00 53 
## 3 Johor     Muar 2013-07-01 12:00:00 51 
## 4 Johor   Pasir Gudang 2013-07-01 12:00:00 55 
## 5 Kedah    Alor Setar 2013-07-01 12:00:00 37 
## 6 Kedah Bakar Arang, Sg. Petani 2013-07-01 12:00:00 48 
+0

谢谢!这工作很好 – 2015-03-03 12:17:45

+0

但是另一个快速的问题。那么我该如何创建一个循环来抓取第一个日期(2013年7月1日)的数据直到当前日期? – 2015-03-03 12:46:50

+0

将此解决方案的核心元素封装为另一个'lapply',并为其提供一个日期序列('?seq.Date'&'?format')的向量,然后使用'dplyr'中的'bind_rows'将它们组合到一起一个大数据框。 – hrbrmstr 2015-03-03 12:55:45

0

您可以使用R readHTMLTable函数从上面给出的马来西亚DOE URL中提取HTML表格。以第一网址为例:

# Make sure you have the XML package installed 
library(XML) 
url <- "http://apims.doe.gov.my/apims/hourly1.php?date=20130701" 
all.tables <- readHTMLTable(url) 
# the URL you gave only has one <table> tag 
table <- all.tables[[1]] 
# and now you have a data frame 'table' which contains the contents 
# of the air pollutant table