2012-07-16 55 views
2

这是我的小数据集。使用r中的各个值工作的循环

Indvidual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J") 
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA) 
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA) 
mydf <- data.frame (Indvidual, Parent1, Parent2) 

    Indvidual Parent1 Parent2 
1   A <NA> <NA> 
2   B <NA> <NA> 
3   C  A  B 
4   D  A  C 
5   E  C  D 
6   F  C  D 
7   G  C  D 
8   H  E <NA> 
9   I  A  D 
10  J  <NA>  <NA> 

只要考虑有两个或一个已知父母的人。我需要通过计算父母的分数来比较和剥夺分数。

规则是parent(parent1或parent2列中的名称)中的任一个是已知的(不是NA),会得到1个额外的分数加上他们的父母得分。如果有两位父母知道,最高得分者将被考虑在内。

下面是一个例子:

Individual "A", has both parents unknown so will get score 0 
Indiviudal "C", has both parents known (i.e. A, B) 
will get 0 score (maximum of their parents) 

加1(因为它具有任一已知的父母之一)从上述数据帧(有解释)

因此预期输出是:

Indvidual Parent1 Parent2 Scores  Explanation 
1   A <NA> <NA> 0  0 (Max of parent Scores NA) + 0 (neither parent knwon) 
2   B <NA> <NA> 0  0 (Max of parent Scores NA) + 0 (neither parent knwon) 
3   C  A  B  1 0 (Max of parent Scores) + 1 (either parent knwon)  
4   D  A  C  2  1 (Max of parent scores) + 1 (either parent knwon) 
5   E  C  D  3  2 (Max of parent scores) + 1 (either parent knwon) 
6   F  C  D  3  2 (Max of parent scores) + 1 (either parent knwon) 
7   G  C  D  3  2 (Max of parent scores) + 1 (either parent knwon) 
8   H  E <NA>  4  3 (Max of parent scores) + 1 (either parent knwon) 
9   I  A  D  3  2 (Max of parent scores) + 1 (either parent knwon) 
10  J  <NA> <NA> 0  0 (Max of parent scores NA) + 0 (neither parent knwon) 

说明:随着循环的进行,它将考虑已计算的分数。 父分数的最大值

编辑:基于追逐的质询

例如:

Individual C has two parents A and B, each of which has Scores calculated as 0 and 0 
(in row 1 and 2 and column Scores), means that max (c(0,0)) will be 0 

Individual E has parents C and D, whose scores in Scores column is (in row 3 and 4), 
1 and 2, respectively. So maximum of max(c(1,2)) will be 2. 
+0

你能解释一下“家长分数的最大值”是什么意思?起初,我认为这是你需要的,但我不认为是这种情况:'rowSums(!is.na(mydf [, - 1]))' – Chase 2012-07-16 12:34:55

+0

谢谢Chase,看看我最近的编辑,如果制作一个感觉......这个想法就像我们走下来一样,我们计算每个人的分数,如果它碰巧是父母,那么它的分数就会用来计算其子/女儿的分数。 – SHRram 2012-07-16 12:54:11

+0

啊,我现在明白了,那些“个人”的人也可以是父母......好的 - 会考虑这个。但现在更清楚了,谢谢。 – Chase 2012-07-16 13:10:41

回答

1
Individual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J") 
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA) 
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA) 
mydf <- data.frame (Individual, Parent1, Parent2, stringsAsFactors = FALSE) 

mydf$Scores <- NA 
mydf$Scores[rowSums(is.na(mydf[, c("Parent1", "Parent2")])) == 2] <- 0 
while(any(is.na(mydf$Scores))){ 
    KnownScores <- mydf[!is.na(mydf$Scores), c(1, 4)] 
    ToCalculate <- mydf[ 
    mydf$Parent1 %in% c(KnownScores$Individual, NA) & 
    mydf$Parent2 %in% c(KnownScores$Individual, NA) & 
    is.na(mydf$Scores), 
    -4] 
    ToCalculate$Score <- apply(
    merge(
     merge(
     ToCalculate, 
     KnownScores, 
     by.x = "Parent1", 
     by.y = "Individual", 
     all.x = TRUE 
    ), 
     KnownScores, 
     by.x = "Parent2", 
     by.y = "Individual", 
     all.x = TRUE 
    )[, 4:5], 
    1, 
    max, 
    na.rm = TRUE) + 1 
    mydf <- merge(mydf, ToCalculate[, c(1, 4)], all.x = TRUE) 
    mydf$Scores[!is.na(mydf$Score)] <- mydf$Score[!is.na(mydf$Score)] 
    mydf$Score <- NULL 
} 
+0

我长期这个循环预计finsh在......我等了大约15分钟,并停止运行......但循环不停止......我怕什么会发生在我的大数据集....我使用的是RGui 64位...是预期的... – SHRram 2012-07-16 13:57:14

+0

事实上,循环似乎没有结束......我想了30分钟 – SHRram 2012-07-16 14:11:10

+0

你是否复制粘贴我的代码与mydf data.frame或你使用另一个data.frame ?因为我得到了一个直接的结果。如果您使用自己的数据,那么数据可能有问题。例如。不包括在个人中的父母。手动运行循环,看看是否有任何ToCalculate $ Score变成不适用 – Thierry 2012-07-16 14:18:51

2

实施例使用plyr和一个递归参数

library(plyr) 
Indvidual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J") 
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA) 
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA) 
mydf <- data.frame (Indvidual, Parent1, Parent2) 
scor.fun<-function(x,mydf){ 
    Explanation<-0 
    P1<-as.character(x$Parent1) 
    P2<-as.character(x$Parent2) 
    score<-as.numeric(!(is.na(P1)||is.na(P1))) 
    if(!(is.na(P1)||is.na(P2))){ 
     Explanation<-max(scor.fun(subset(mydf,Indvidual==P1),mydf)[1],scor.fun(subset(mydf,Indvidual==P2),mydf)[1]) 
     score<-score+Explanation 
    }else{ 
     Explanation<-ifelse(is.na(P1),0,scor.fun(subset(mydf,Indvidual==P1),mydf)[1]) 
     Explanation<-max(Explanation,ifelse(is.na(P2),0,scor.fun(subset(mydf,Indvidual==P2),mydf)[1])) 
     score<-score+Explanation 
    } 
    c(score,Explanation) 
} 

adply(mydf,1,scor.fun,mydf) 

大概不会最好的在大数据框上递归的想法。