2016-11-06 51 views
-1

我有一组数据,其中一套药物对一组受试者的治疗结果在一组医院内进行了测量。 (#drugs> #subjects> #hospitals)有效填充基质

subjects <- paste("S",1:100,sep="_") 
drugs <- paste("D",1:1000,sep="_") 

data.frame在每个每一行drugsubjecthospitaloutcome组合:

df <- expand.grid(subject=subjects,drug=drugs,stringsAsFactors=F) 
hospitals <- paste("H",1:10,sep="_") 
df$hospital <- rep(sapply(hospitals,function(h) rep(h,10)),200) 
set.seed(1) 
df$outcome <- runif(nrow(df),0,100) 

现在我想建立一个matrix其中每个排是独特的hospitalsubject组合,每一列是独特的hospitaldrug组合。这里有可能建立这个矩阵不能很好有效的方法:

df$hospital.subject <- paste(df$hospital,df$subject,sep=":") 
df$hospital.drug <- paste(df$hospital,df$drug,sep=":") 

hospital.subject <- unique(paste(df$hospital,df$subject,sep=":")) 
hospital.drug <- unique(paste(df$hospital,df$drug,sep=":")) 

mat <- do.call(rbind,lapply(hospital.subject, function(x){ 
    hospital.subject.df <- dplyr::filter(df,hospital.subject==x) 
    res <- rep(NA,length(hospital.drug)) 
    match.idx <- match(hospital.drug,hospital.subject.df$hospital.drug) 
    res[which(!is.na(match.idx))] <- hospital.subject.df$outcome[match.idx[which(!is.na(match.idx))]] 
    return(res) 
})) 
rownames(mat) <- hospital.subject 
colnames(mat) <- hospital.drug 

所以问题#1是如何更有效地这是否可能建立这个矩阵。现在

,由于矩阵是稀疏矩阵我想插补各hospital.subject组合在其hospital.drug组合,即,其中没有观察到这些subjects缺失值,根据它们被观察到的hospital.drug组合,从正态分布与mean = mediansd = mad这些观察到的hospital.subject组合。

换句话说,例如用于subjects[1:10],将其仅在hospitals[1]观察到的,从hospitals[1]填写为hospitals[2:10]对于每个相应drug。这意味着:

mat[1:10,2:10] <- rnorm(90,median(mat[1:10,1]),mad(mat[1:10,1]))

mat[1:10,12:20] <- rnorm(90,median(mat[1:10,1]),mad(mat[1:10,1]))

等一个和下一个医院(在垫子行),例如,

mat[31:40,2:10] <- rnorm(90,median(mat[31:40,1]),mad(mat[31:40,1]))

mat[31:40,12:20] <- rnorm(90,median(mat[31:40,1]),mad(mat[31:40,1]))

使用for循环我会这样做:

for(h in 1:length(hospitals)){ 
    row.idx <- which(grepl(paste0(hospitals[h],":"),hospital.subject)==T) 
    col.idx <- which(grepl(paste0(hospitals[h],":"),hospital.drug)==T) 
    for(i in 1:length(col.idx)){ 
    drug <- strsplit(hospital.drug[col.idx[i]],split=":")[[1]][2] 
    impute.idx <- which(grepl(paste0(":",drug,"$"),hospital.drug,perl=T)==T)[-col.idx[i]] 
    mat[row.idx,impute.idx] <- rnorm(length(row.idx)*length(impute.idx),mean=median(mat[row.idx,col.idx[i]]),sd=mad(mat[row.idx,col.idx[i]])) 
    } 
} 

有没有更高效和更优雅的方法来实现这个目标?

还有一点,我的实际数据组织得比这个例子好,因为每个医院的受试者人数并不相同,另外还有一个以上的医院使用同一种药物治疗的受试者。

回答

2

这是你想要的吗?

df$hos.sub=paste(df$hospital,df$subject) 
df$hos.dru=paste(df$hospital,df$drug) 

ind1 <- list(factor(df$hos.sub),factor(df$hos.dru)) 
res<-tapply(df[,"outcome"],ind1,mean) 
head(res[,1:10]) 

> head(res[,1:9]) 
      H_1 D_1 H_1 D_10 H_1 D_100 H_1 D_1000 H_1 D_101 H_1 D_102 H_1 D_103 H_1 D_104 H_1 D_105 
H_1 S_1 26.550866 83.189899 6.516364 45.77171 6.471249 26.6257392 81.14044 9.088058 67.64499 
H_1 S_10 6.178627 4.288589 45.675309 77.90078 3.338293 95.5751769 92.02642 49.810641 14.31814 
H_1 S_2 37.212390 76.684275 27.743618 21.32599 67.661240 66.0476814 82.46891 97.271288 88.86986 
H_1 S_3 57.285336 27.278032 60.041069 55.22206 73.537169 21.2416518 91.60083 85.267414 95.01507 
H_1 S_4 90.820779 18.816330 27.314448 13.21052 11.129967 0.5266102 72.34151 49.899330 91.69972 
H_1 S_5 20.168193 22.576183 94.148905 44.60504 4.665462 10.2902506 91.02545 27.440370 90.51900 
+0

我不认为这是在我的问题中描述的方式推算 – dan