2016-07-16 21 views
0

我得到了旅游交易数据集是这样的(约56万人次):数据帧1如何使用R提供每周简介?

ID  START TIME   DATE   ORIGIN DESTINATION  DAY 
1005   9.10   2012-01-02   A  B   Monday 
1005   18.15   2012-01-02   B  A   Monday 
1005   9.05   2012-01-08   A  B   Sunday 
1005   17.05   2012-01-08   B  A   Sunday 
1010   8.00   2012-01-09   A  C   Monday 
1010   12.00   2012-01-09   C  A   Monday 
1013   13.15   2012-01-10   D  E   Tuesday 
1013   15.30   2012-01-10   E  G   Tuesday 
1013   9.06   2012-01-12   D  E   Thursday 
...   ...   2012-..-..   .  .   ... 

和ID指数像这样(约1986年的ID):数据帧2

ID 
1005 
1010 
1013 
1015 
1030 
1034 
1036 
1031 
1040 
... 

我想创建一个基于这两个数据框的每周旅行概况。我不知道我是否是对的,但我想这些代码:

weekday = c("Sunday", "Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") 
    br = seq(0,23,by=1) 
ranges = paste(head(br,-1), br[-1], sep="_") 

      for (i in dataframe2$ID) { 

        for (n in weekday){ 
        x= filter(dataframe1,dataframe1$ID %in% i & dataframe1$DAY %in% n) 
        freq = hist(as.numeric(x), br, include.lowest=TRUE, plot=FALSE) 
        df = as.data.frame(t(data.frame(frequency = freq$counts))) 
        df$i = i 
        df$n = n 
        colnames(df) = c(as.character(ranges),"ID","Day") 
        write.table(head(df),file="testdata1.csv", append=TRUE,sep=",",col.names=FALSE,row.names=FALSE) 
        } 
       } 

我想和包含其每周的行程频率的CSV表来结束。我也想问问是否有简单的方法来简化这项任务。

ID  0_1 1_2 2_3 3_4 4_5 5_6 6_7 7_8 8_9 9_10 10_11 11_12 12_13 13_14 14_15 15_16 16_17 17_18 18_19 19_20 20_21 21_22 22_23 Day 
1005 0 0 0 0 0 0 0 0 0 1  0  0  0  0  0  0  0  1  0  0  0  0  0 Sunday 
1005 0 0 0 0 0 0 0 0 0 1  0  0  0  0  0  0  0  1  0  0  0  0  0 Monday 
1005 0 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0 Tuesday 
1005                               Wednesday 
1005                               Thursday 
1005                               Friday 
1005                              Saturday 
1010                               Sunday 
1010 
1010 
1010 
1010 
1010 
1010 
到底

我想制作一个图表是这样的: enter image description here

+0

它的更好,如果你'dput'您的数据为您图中的数据总结 –

回答

1

这可以在基础R使用功能xtabs做,但它可能是一个有点更清楚,如果我们做到这一点使用dplyrtidyr包。通过这种方法,weekday被创建为R因子变量。然后使用dplyr函数mutateDAY转换为因子并将START_TIME转换为整数。我们接下来使用tidyr包中的complete来创建一个新的扩展数据帧,其中每个值为ID,DAYSTART_TIME,使用它们的完整值范围(例如每个ID的行,对于0:23中的每个开始时间和一周中的每一天,他们存在DATEORIGIN,和DESTINATION被使用的值;否则DATE, ORIGIN,DESTINATION列具有NA值每ID, DAY,START_TIME,行程的数量被计算为行的总和,其。没有NA的值为DATE并存储在Freqspread函数来自tidyr用于将Freq的每个不同值转换为单独的列。最后分配适当的列名称,按照请求的顺序排列列,并将写入文件的数据框以csv的形式写入。

library(dplyr) 
    library(tidyr) 
# 
# input data is in df 
# convert colunm name START TIME to syntactically correct version START_TIME 
# 
    colnames(df)[2] <- "START_TIME" 
# 
# define weekday as a factor with the days of week 
# 
    weekday <- c("Sunday", "Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") 
    weekday <- factor(weekday, levels=weekday) 
# 
# sum number for trips by ID, DAY, and START_TIME 
# 
    trip_freq <- df %>% mutate(DAY = factor(DAY, levels=levels(weekday)), 
           START_TIME=floor(START_TIME)) %>% 
         complete(ID, DAY=weekday, START_TIME=0:23) %>% 
         group_by(ID, DAY, START_TIME) %>% 
         summarise(Freq = sum(!is.na(DATE))) 
    trip_freq_tbl <- trip_freq %>% spread(key = START_TIME, value=Freq) 
# 
# name and re-arrange columns 
# 
    colnames(trip_freq_tbl) <- c("ID", "Day", paste(0:23,1:24,sep="_")) 
    trip_freq_tbl <- cbind(trip_freq_tbl[,-2], Day=trip_freq_tbl[,"Day"])    
# 
# write trip_freq as csv fle 
# 
    write.table(trip_freq_tbl, file="testdata1.csv", sep=",", row.names=FALSE)  

可以进一步与

# 
# summarize the data for the plot 
# 
    trip_freq_plot <- trip_freq %>% group_by(DAY, START_TIME) %>% 
            summarize(Cnt = sum(Freq))