在R中的两个数据帧之间进行操作

我的一个简单问题是：你如何做两个数据帧之间的ks.test。在R中的两个数据帧之间进行操作

例如，我们有两个数据帧：

D1 <-data.frame (D$Ag,D$Al,D$As,D$Ba,D$Be,D$Ca,D$Cd,D$Co,D$Cu,D$Cr) 
D2 <-data.frame (S$Ag,S$Al,S$As,S$Ba,S$Be,S$Ca,S$Cd,S$Co,S$Cu,S$Cr)

注意：这只是一个例子 - 真实案例将包括更多的列，它们包含在一个特定的位置一定元素的浓度。

现在我想运行两个数据帧之间的ks.test：

ks.test(D$Ag,S$Ag) 
ks.test(D$Al,S$Al) 
ks.test(D$As,S$As)

等如何，如果没有做奴隶制所做的工作？

当我做了一个数据帧的shapiro.test我简单地使用：

lshap1 <- lapply(D1, shapiro.test) 
lres1 <- sapply(lshap1, `[`, c("statistic","p.value"))

我所读过的东西ABOT循环，骨料，mapply - 尝试不同的东西，如：

apply(D1, 2, function(D2) ks.test(D2,D1[,1])$p.value)

但后来我得到了很多p值= 0 ..当我手动执行时，情况并非如此。我将数据导入为两个数据帧，然后将一些数据提取到“较小”的数据帧中进行分析 - 例如，在这种情况下查看有毒元素并排除其他元素。

这是dput（head（D1））和dput（head（D2））的输出是一个较小的数据帧。

D1 <- data.frame(DF$As,DF$Cd,DF$Cu,DF$Cr,DF$Ni,DF$Pb,DF$Zn) 
D2 <- data.frame(DO$As,DO$Cd,DO$Cu,DO$Cr,DO$Ni,DO$Pb,DO$Zn) 
##Output dput(head(D1)): 
structure(list(DF.As = c(-0.154868225169351, -0.291459578010276, 
0.0355227595866723, 0.0892191549433623, 0.189115121672669, 
-0.365222418641706 
), DF.Cd = c(1.28810277421719, 1.45844987179892, 0.642331353138319, 
0.673164023466527, 0.131548822144598, 0.146964746525726), DF.Cu 
c(8.01131080231879, 
6.52606822875086, 2.93449454196807, 4.08720148249298, 1.55494291704341, 
1.73663851851503), DF.Cr = c(0.164849379809527, 0.196759436988158, 
0.307645386162046, 0.302917612808149, 0.187202322026229, 0.25358922601195 
), DF.Ni = c(0.362592459542858, 0.527078409257359, 0.477116357433909, 
0.469287608844157, 0.225865184678244, 0.355321456594576), DF.Pb 
c(0.414448963979605, 
0.616598678960665, -0.0531899082482045, 0.47477978516042, 
0.422106471495816, 
0.0326241032568164), DF.Zn = c(74.7657982668, 74.2978919524635, 
36.6575117549406, 47.8440365300156, 21.4962811912273, 23.3823413091772 
)), .Names = c("DF.As", "DF.Cd", "DF.Cu", "DF.Cr", "DF.Ni", "DF.Pb", 
"DF.Zn"), row.names = c(NA, 6L), class = "data.frame") 
##Output dput(head(D2)): 
structure(list(DO.As = c(0.0150158517208966, -0.0477743050574027, 
-0.121541780066373, -0.0376195600535572, 0.115393920133327, 
0.265450918075612), DO.Cd = c(0.367936811743133, 0.445545318262818, 
0.350071986298948, 
0.331513644782201, 0.603874629105229, 0.598527030667747), DO.Cu 
c(1.65127139067621, 
1.90306634226191, 1.08280240161368, 1.12130376047927, 1.23137174481965, 
1.16618813144813), DO.Cr = c(0.162996340978278, 0.493799568371693, 
0.18441814919492, 0.179883906525139, 0.128058190333676, 0.030406737049484 
), DO.Ni = c(0.290717040452464, 0.331891307317008, 0.387987078391917, 
0.36147470695146, 0.774910299821917, 0.323259411199816), DO.Pb 
c(-0.0584055598838365, 
0.377799120780818, -0.0741768575020139, 0.511278669452117, 
0.320822577941608, 0.250377389869303), DO.Zn = c(16.5625482436821, 
14.5084409384572, 16.571001044493, 18.4509635406253, 15.6876446591721, 
12.7649440587945)), .Names = c("DO.As", "DO.Cd", "DO.Cu", "DO.Cr", "DO.Ni", 
"DO.Pb", "DO.Zn"), row.names = c(NA, 6L), class = "data.frame")

我张贴这是我仍然得到一个错误：

## This is code for execution: 
col.names =colnames(D1) 
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D1,D2) 
## Output: 
Error in `[.data.frame`(d2, , t) : undefined columns chosen

（回溯按钮所示）：

6.stop("undefined columns selected") 
5.`[.data.frame`(d2, , t) 
4.d2[, t] 
3.ks.test(d1[, t], d2[, t]) 
2.FUN(X[[i]], ...) 
1.lapply(col.names, function(t, d1, d2) {ks.test(d1[, t], d2[, t])}, D1, D2)

来源

2017-10-06 Ib Nemer

*注：我的主要目标是做用KS两个数据集的分布比较.test - 比较第1列和第1,2和第2,3和3列等等... –

创建了两个data.frames D1和D2一些随机数和相同的列名称。

set.seed(12) 
D1 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.5),B=rnorm(n = 30,mean = 4.5,sd = 2.2),C=rnorm(n = 30,mean = 2.5,sd = 12)) 
D2 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.49),B=rnorm(n = 30,mean = 4.4,sd = 2.2),C=rnorm(n = 30,mean = 2,sd = 12))

现在我们可以通过使用列名中循环，并把它传递给D1和D2执行在各自data.frames的相应列的ks.test。

col.names = colnames(D1) 
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D1,D2) 

#[[1]] 

#Two-sample Kolmogorov-Smirnov test 

#data: d1[, t] and d2[, t] 
#D = 0.167, p-value = 0.81 
#alternative hypothesis: two-sided 


#[[2]] 

#Two-sample Kolmogorov-Smirnov test 

#data: d1[, t] and d2[, t] 
#D = 0.233, p-value = 0.39 
#alternative hypothesis: two-sided 


#[[3]] 

#Two-sample Kolmogorov-Smirnov test 

#data: d1[, t] and d2[, t] 
#D = 0.2, p-value = 0.59 
#alternative hypothesis: two-sided

在您在问题描述中使用的符号，理想情况下，下面的代码应该工作：

col.names =colnames(S) 
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D,S)

来源

2017-10-06 11:32:39 TUSHAr

请问您能解释一下这一行的作用： lapply（col.names，function（t，d1，d2）{ks.test （d1 [，t]，d2 [，t]）}，D1，D2） –

我得到一个错误：“undefined columns selected” 但是当我问D1 [，T]我得到列表... mystical –

@lb Nemer：首先't'是小写（D1 [，t]）。 'col.names'具有所有列的名称（在D1和D2中都是相同的）。所以你通过col循环。名称，就像for循环一样，子集D1和D2使用列名作为D1 [，t]'和'D2 [，t]'，并在'ks.test'函数中使用它们。我没有收到任何错误，因为我已经用完整的代码复制了这个例子。也许你应该检查你所使用的两个data.frames的列名是否相同。 – TUSHAr

在R中的两个数据帧之间进行操作

回答

相关问题