2016-11-20 90 views
-1

我正在寻找一些data.table和/或dplyr的帮助。我有这样一个数据帧:使用data.table或dplyr与groupby和日期子集处理R中的数据

Name  Date   X  Y 
Mike  2016-10-21 3.2 1.6 
Mike  2016-10-23 3.1 1.4 
Mike  2016-10-24 4.9 3.8 
Mike  2016-10-25 5.7 4.2 
Mike  2016-10-28 0.2 -1.1 
Bob  2016-10-21 2.2 -1.1 
Bob  2016-10-22 0.2 -3.6 
Bob  2016-10-24 -9.2 -14.1 
Bob  2016-10-25 -7.2 -12.1 
Alice 2016-10-20 7.2 6.1 
Alice 2016-10-21 2.2 0.1 
Alice 2016-10-23 13.2 8.1 
Alice 2016-10-25 12.6 8.8 
Alice 2016-10-27 7.7 4.7 
Alice 2016-10-28 8.2 5.0 

我希望能够返回X的平均& Y代表每个人,但是,我想子集,以便它仅使用值从每个人的最近3次日期中忽略来自较早日期的数据。我还想返回这3个最近日期之间的天数。理想情况下,我最终会得到这样的数据帧:

Name  DaysBetween avgX avgY 
Mike    4 3.6  2.3 
Bob    3 -5.4 -9.9 
Alice    3 9.5  6.2 

编辑注:此数据将始终按日期排序,所以我们也许也只是采取了“最后3”的数据点,每个人,而不是尝试使用日期逻辑来找出哪三个是最近的。

非常感谢您的帮助!

+0

嗯你有没有尝试过任何产生错误或错误结果? – lukeA

+0

不,我没有。到目前为止,我只能得到每个人所有数据点的X和Y的平均值。但我正在努力如何使用最近3个日期的子集。尽管如此,我还是一个有data.table和dplyr的noob。 – user3808992

回答

0

您可以使用dplyr::top_n来过滤数据:

library(dplyr) 

df %>% mutate(Date = as.Date(Date)) %>% # parse to Date class, if not already 
    group_by(Name) %>% 
    top_n(3, Date) %>% # filter to max 3 dates for each group 
    summarise(DaysBetween = max(Date) - min(Date), 
       avgX = mean(X), 
       avgY = mean(Y)) 

## # A tibble: 3 × 4 
##  Name DaysBetween avgX  avgY 
## <fctr>  <time> <dbl>  <dbl> 
## 1 Alice  3 days 9.5 6.166667 
## 2 Bob  3 days -5.4 -9.933333 
## 3 Mike  4 days 3.6 2.300000 
+0

谢谢!这很好用! – user3808992

2

我们可以使用data.table

library(data.table) 
setDT(df1)[order(-Date), .(DaysBetween = as.integer(Date[1L] - Date[3L]), 
     avgX = mean(X[1:3]), avgY = round(mean(Y[1:3]),2)), by = Name] 
# Name DaysBetween avgX avgY 
#1: Mike   4 3.6 2.30 
#2: Alice   3 9.5 6.17 
#3: Bob   3 -5.4 -9.93 
+1

谢谢你的回应。这工作也很好! – user3808992

+0

@ user3808992感谢您的回复。你也可以阅读[this](http://stackoverflow.com/help/someone-answers) – akrun

1

以上都是很好的回应,这里是一个迭代的方法:

#initialize the output frame 
outputFrame = as.data.frame(matrix(nrow = length(unique(train$Name)), 
ncol = 4)) 

#renaming the data frame 
names(outputFrame) = c("Names", "daysBetween", "avgX", "avgY") 

#turn the date to a date 
train$Date = as.Date(train$Date, "%m/%d/%Y") 

#initialize the outputCounter 
outputCounter = 1 

#iterates over every unique Name in the data frame 
for(name in as.character(unique(train$Name))) 
{ 
    #subsets the dataframe into the values of each given level of Name 
    dfSubset = train[which(train$Name == name),] 

    #Orders the dataframe by date 
    dfSubset = dfSubset[order(dfSubset$Date),] 

    #get the 3 most recent dates 
    dfSubset = dfSubset[(nrow(dfSubset) -2):nrow(dfSubset),] 

    #fill the names 
    outputFrame$Names[outputCounter] = name 

    #fill the days between 
    outputFrame$daysBetween[outputCounter] = as.numeric(max(dfSubset$Date) - min(dfSubset$Date)) 

    #get the average X 
    outputFrame$avgX[outputCounter] = mean(dfSubset$X) 

    #get the average Y 
    outputFrame$avgY[outputCounter] = mean(dfSubset$Y) 

    #increment outputCounter 
    outputCounter = outputCounter +1 
} 

假设火车是你的数据帧