2016-11-29 111 views
0

我想要做的事很简单。但是,我是R新手,并没有学到太多有关循环和函数的知识,也不确定什么才是最有效的方法来获得结果。基本上,我想计算符合我的条件并进行分区的行数。下面是一个例子:根据条件重复计算

df1 <- data.frame(
    Main = c(0.0089, -0.050667, -0.030379, 0.066484, 0.006439, -0.026076), 
    B = c(NA, 0.0345, -0.0683, -0.052774, 0.014661, -0.040537), 
    C = c(0.0181, 0, -0.056197, 0.040794, 0.03516, -0.022662), 
    D = c(-0.0127, -0.025995, -0.04293, 0.057816, 0.033458, -0.058382) 
) 
df1 
# Main  B   C   D 
# 1 0.008900 NA   0.018100 -0.012700 
# 2 -0.050667 0.034500 0.000000 -0.025995 
# 3 -0.030379 -0.068300 -0.056197 -0.042930 
# 4 0.066484 -0.052774 0.040794 0.057816 
# 5 0.006439 0.014661 0.035160 0.033458 
# 6 -0.026076 -0.040537 -0.022662 -0.058382 

我对分子的标准是计算的B/C/D>0Main>0数目;对于分母,请计算B/C/D的数量即!= 0Main!= 0。我可以使用length(which(df1$Main >0 & df1$B>0))/length(which(df1$Main !=0 & df1$B !=0))分别获取每个列的比率。但我的数据集有更多列,我想知道是否有办法让那些比一下子让我的结果会是这样的:

# B   C   D 
# 1 0.2  0.6  0.3 

回答

2

使用适用于:

apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0))/length(which(df1$Main !=0 & x !=0))) 
1
criteria1 <- df1[which(df1$Main > 0), -1] > 0 
criteria2 <- df1[which(df1$Main != 0), -1] != 0 
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T) 
##   B   C   D 
## 0.2000000 0.6000000 0.3333333 

编辑:看来NIEK的方法对于此特定数据最快

# Unit: microseconds 
#   expr  min  lq  mean median  uq  max neval 
#  Jim(df1) 216.468 230.0585 255.3755 239.8920 263.6870 802.341 300 
# emilliman5(df1) 120.109 135.5510 155.9018 142.4615 156.0135 1961.931 300 
#  Niek(df1) 97.118 107.6045 123.5204 111.1720 119.6155 1966.830 300 
#  nine89(df1) 211.683 222.6660 257.6510 232.2545 252.6570 2246.225 300 
#[[1]] 
#   [,1] [,2]  [,3] [,4] 
#median 239.892 142.462 111.172 232.255 
#ratio 1.000 0.594 0.463 0.968 
#diff  0.000 -97.430 -128.720 -7.637 

但是,当列数很多时,矢量化方法更快。

Nrow <- 1000 
Ncol <- 1000 
mat <- matrix(runif(Nrow*Ncol),Nrow) 
df1 <- data.frame(Main = sample(-2:2,Nrow,T), mat) #1001 columns 

#Unit: milliseconds 
#   expr  min  lq  mean median  uq  max 
#  Jim(df1) 46.75627 53.88500 66.93513 56.58143 62.04375 185.0460 
#emilliman5(df1) 73.35257 91.87283 151.38991 178.53188 185.06860 292.5571 
#  Niek(df1) 68.17073 76.68351 89.51625 80.14190 86.45726 200.7119 
# nine89(df1) 51.36117 56.79047 74.53088 60.07220 66.34270 191.8294 

#[[1]] 
#   [,1] [,2] [,3] [,4] 
#median 56.581 178.532 80.142 60.072 
#ratio 1.000 3.155 1.416 1.062 
#diff 0.000 121.950 23.560 3.491 

功能

Jim <- function(df1){ 
    criteria1 <- df1[which(df1$Main > 0), -1] > 0 
    criteria2 <- df1[which(df1$Main != 0), -1] != 0 
    colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T) 
} 


emilliman5 <- function(df1){ 
    apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0))/length(which(df1$Main !=0 & x !=0))) 
} 

Niek <- function(df1){ 
    ratio1<-vector() 
    for(i in 2:ncol(df1)){ 
     ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0))/length(which(df1$Main !=0 & df1[,i] !=0)) 
    } 
    ratio1 
} 

nine89 <- function(df){ 
    tail(colSums(df[df$Main>0,]>0, na.rm = T)/colSums(df[df$Main!=0,]!=0, na.rm = T), -1) 
} 
1

一种方式做,这将是一个for循环遍历列,并应用你写的功能。事情是这样的:

ratio1<-vector() 
for(i in 2:ncol(df1)){ 
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0))/length(which(df1$Main !=0 & df1[,i] !=0)) 
} 

也许有更好的方法来做到这一点与应用或data.table,但是这是一个简单的解决方案,我可以拿出。适用于任意数量的列。如果您想要一位小数的答案,请使用round()

2

你可以这样做矢量(无applyfor需要):

tail(colSums(df[df$Main>0,]>0, na.rm = T)/colSums(df[df$Main!=0,]!=0, na.rm = T), -1) 

#  B   C   D 
#0.2000000 0.6000000 0.3333333