2017-03-01 47 views
0

我有一个数据集象下面这样:与分组一些特征子集的意见

date,  time,product,shop_id 

20140104 900 Banana 18 
20140104 900 Banana 19 
20140104 924 Banana 18 
20140104 929 Banana 18 
20140104 932 Banana 20 
20140104 948 Banana 18 

,我需要与不同product提取的意见,和不同shop_id

所以,我需要组观察由product+shop_id

这里是我的代码:

library(plyr) 
    d_ply(shop, .(product,shop_id ),table ) 
print(p) 

不幸的是,它打印null

数据集:

date=c(20140104,20140104,20140104,20140104,20140104) 
time=c(924 ,900,854,700,1450) 
product=c(Banana ,Banana ,Banana ,Banana ,Banana) 
shop_id=c(18,18,18,19,20) 
shop<-data.frame(date=date,time=time,product=product,shop_id=shop_id) 

输出应该是

  date, time, product, shop_id 


     20140104 900 Banana 19 
     20140104 932 Banana 20 
     20140104 948 Banana 18 
+2

什么是'time' 948和932 –

+0

选择给定的行他们有diferent'shop_id'逻辑。每个选定的观察应该有独特的产品或shop_id,或两者都 – user5363938

+1

但为什么你选择时间948而不是900当从商店18香蕉? – ira

回答

0

我们可以做

library(tidyverse) 
shop %>% 
    group_by(product, shop_id) %>% 
    mutate(n = n()) %>% 
    group_by(time) %>% 
    arrange(n) %>% 
    slice(1) %>% 
    group_by(product, shop_id) %>% 
    arrange(-time) %>% 
    slice(1) %>% 
    select(-n) %>% 
    arrange(time) 
#  date time product shop_id 
#  <int> <int> <chr> <int> 
#1 20140104 900 Banana  19 
#2 20140104 932 Banana  20 
#3 20140104 948 Banana  18 
+0

我希望人们停止纵容这种导入未使用的库的反模式。明确。 –

0

为了仅取第一个独特的组合,只需使用aggregate从包stats

> aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){x[1]}) 

Group.1 Group.2  date time product shop_id 
1 Banana  18 20140104 924 Banana  18 
2 Banana  19 20140104 700 Banana  19 
3 Banana  20 20140104 1450 Banana  20 

说明:我FUN=function(x){x[1]}仅需第一元件在碰撞

的情况下

要删除 “Group.1”, “Group.2” 或其他列:

> res <- aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){x[1]}) 
> res[ , !(names(res) %in% c("Group.1", "Group.2"))] 
     date time product shop_id 
1 20140104 924 Banana  18 
2 20140104 700 Banana  19 
3 20140104 1450 Banana  20 

PS您提供的数据集与您所需的示例不一致,所以这就是为什么数字有所不同。

PS2如果你想在碰撞的情况下,所有的数据:

> aggregate(shop, by=list(shop$product, shop$shop_id), FUN="identity") 
    Group.1 Group.2       date   time product shop_id 
1 Banana  18 20140104, 20140104, 20140104 924, 900, 854 1, 1, 1 18, 18, 18 
2 Banana  19      20140104   700  1   19 
3 Banana  20      20140104   1450  1   20 

如果你想标记的碰撞:

> aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){if (length(x) > 1) NA else x}) 
    Group.1 Group.2  date time product shop_id 
1 Banana  18  NA NA  NA  NA 
2 Banana  19 20140104 700  1  19 
3 Banana  20 20140104 1450  1  20 

如果要排除非唯一行:

> res <- aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){if (length(x) > 1) NULL else x}) 

> res[res$product != "NULL", !(names(res) %in% c("Group.1", "Group.2"))] 
     date time product shop_id 
2 20140104 700  1  19 
3 20140104 1450  1  20 

如果要避免从字符串转换为Int(对于产品),请使用“”/“NULL”/“NA”而不是NULL/NA。

0

它可以通过dplyr如下进行:

# create the sample dataset 
date=c(20140104,20140104,20140104,20140104,20140104) 
time=c(924 ,900,854,700,1450) 
product=c("Banana","Banana","Banana","Banana","Banana") 
shop_id=c(18,18,18,19,20) 
shop<-data.frame(date=date,time=time,product=product,shop_id=shop_id) 

# load a dplyr library 
library(dplyr) 

# take shop data 
shop %>% 
     # group by product, shop id, date 
     group_by(product, shop_id, date) %>% 
     # for each such combination, find the earliest time 
     summarise(time = min(time)) %>% 
     # group by product, shop id 
     group_by(product, shop_id) %>% 
     # for each combination of product & shop id 
     # return the earliest date and time recorded on the earliest date 
     summarise(date = min(date), time = time[date == min(date)])