2015-02-08 64 views
5

我有个人级别的数据,我试图根据组动态总结结果。按data.table中的组绘图

例子:

set.seed(12039) 
DT <- data.table(id = rep(1:100, each = 50), 
       grp = rep(letters[1:4], each = 1250), 
       time = rep(1:50, 100), 
       outcome = rnorm(5000)) 

我想知道绘制组级别摘要最简单的方式,数据载于:

DT[ , mean(outcome), by = .(grp, time)] 

我想是这样的:

​​

但这根本不起作用。

我上幸存的可行选项(可以很容易地循环)为:

plot(DT[grp == "a", mean(outcome), by = time]) 
lines(DT[grp == "b", mean(outcome), by = time]) 
lines(DT[grp == "c", mean(outcome), by = time]) 
lines(DT[grp == "d", mean(outcome), by = time]) 

(与颜色等添加的参数,排除了简洁)

这令我不做到这一点的最好方法 - 给予data.table在处理群体方面的技巧,是不是有更优雅的解决方案?

其他来源已经指向我matplot,但我不能看到一个简单的方法来使用它 - 我需要重塑DT,并有一个简单的reshape,将完成这项工作?

回答

4

基地ř使用matplotdcast溶液

dt_agg <- dt[ , .(mean = mean(outcome)), by=.(grp,time)] 
dt_cast <- dcast(dt_agg, time~grp, value.var="mean") 
dt_cast[ , matplot(time, .SD[ , !"time", with=FALSE], 
        type="l", ylab="mean", xlab="")] 
#or, if you've got the data.table version 1.9.7+: 
# (see https://github.com/Rdatatable/data.table/wiki/Installation) 
dt_cast[ , matplot(time, .SD, type="l", ylab="mean", xlab=""), .SDcols = !"time"] 

结果: enter image description here

+2

这个工作,但'dt_cast [,setdiff(名称(dt_cast), “时间”),其中= F]'或'dt_cast [ ,当有多个组时,需要使用等级(dt $ grp),其中= F]'。谢谢! – MichaelChirico 2015-02-09 12:41:17

+0

实际上,最近更新到'data.table'就更容易了! – MichaelChirico 2016-10-05 03:21:08

0

使用reshape2您可以将数据集转换成能方式:

new_dt <- dcast(dt,time~grp,value.var='outcome',fun.aggregate=mean) 

new_dt_molten <- melt(new_dt,id.vars='time') 

,然后用GGPLOT2这样的情节是:

ggplot(new_dt_molten,aes(x=time,y=value,colour=variable)) + geom_line() 

或者,(简单的解决方案实际上),你可以使用数据集,您可以直接执行类似操作:

ggplot(dt,aes(x=time,y=outcome,colour=grp)) + geom_jitter() + geom_smooth(method='loess') 

ggplot(dt,aes(x=time,y=outcome,colour=grp)) + geom_smooth(method='loess') 
4

你是非常正确的轨道。使用ggplot来做到这一点如下:

(dt_agg <- dt[,.(mean = mean(outcome)),by=list(grp,time)]) # Aggregated data.table 
    grp time  mean 
    1: a 1 0.75865672 
    2: a 2 0.07244879 
--- 

现在ggplot这个聚合的数据。表

require(ggplot2) 
ggplot(dt_agg, aes(x = time, y = mean, col = grp)) + geom_line() 

结果: enter image description here

4

有一种方法用做此data.tableby参数,如下所示:

DT[ , mean(outcome), by = .(grp, time) 
    ][ , {plot(NULL, xlim = range(time), 
      ylim = range(V1)); .SD} 
     ][ , lines(time, V1, col = .GRP), by = grp] 

注意的是,中间部分{...; .SD}需要继续链接。如果DT[ , mean(outcome), by = .(grp, time)]已经保存为另一种data.tableDT_m,那么我们可以只执行:

DT_m[ , plot(NULL, xlim = range(time), ylim = range(V1))] 
DT_m[ , lines(time, V1, col = .GRP), by = grp] 

随着输出

data.table group by

很多发烧友的结果是可能的;例如,如果我们想为每个组指定特定的颜色:

grp_col <- c(a = "blue", b = "black", 
      c = "darkgreen", d = "red") 
DT[ , mean(outcome), by = .(grp, time) 
    ][ , {plot(NULL, xlim = range(time), 
      ylim = range(V1)); .SD} 
     ][ , lines(time, V1, col = grp_col[.BY$grp]), by = grp] 

注:

有在RStudio一个错误,这将导致该代码失败如果输出发送到RStudio图形设备。因为这种方法只能从命令行上的R或将输出发送到外部设备(我将它发送到png以产生上述内容)。

参见data.table issue #1524this RStudio support ticket,并且这些SO适量(12