1
在xtab中的水平对于样本数据帧:问题与R中
df <- structure(list(area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L), .Label = c("a1", "a2", "a3", "a4"), class = "factor"),
result = c(0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L),
weight = c(0.5, 0.8, 1, 3, 3.4, 1.6, 4, 1.6, 2.3, 2.1, 2,
1, 0.1, 6, 2.3, 1.6, 1.4, 1.2, 1.5, 2, 0.6, 0.4, 0.3, 0.6,
1.6, 1.8)), .Names = c("area", "result", "weight"), class = "data.frame", row.names = c(NA,
-26L))
我试图找出最高和最低的区域面积,然后产生一个加权交叉表,然后将其用于计算风险差。
df.summary <- setDT(df)[,.(.N, freq.1 = sum(result==1), result = weighted.mean((result==1),
w = weight)*100), by = area]
#Include only regions with highest or lowest percentage
df.summary <- data.table(df.summary)
incl <- df.summary[c(which.min(result), which.max(result)),area]
df.new <- df[df$area %in% incl,]
incl
“含”有我想要的两个领域,但仍四个层次:
[1] a2 a3
Levels: a1 a2 a3 a4
如何摆脱水平的呢?随后的分析,我想要做的只是两个层面以及区域。有任何想法吗?
但是它是一个data.table,所以'df.new [,area:= factor(area)]'保存'df.new'的变量名称重复更为习惯。 –