如何针对具有多个组的数据集对每个组进行PCA？

我个人从四个群体，四个处理和三次重复的数据集。每个人只有一个人口，治疗和复制组合。我从每个人身上取得了四次测量结果。我想针对每个群体，底物和重复组合对这些测量进行PCA。如何针对具有多个组的数据集对每个组进行PCA？

我意识到如何对所有个体做PCA，我可以将数据集分成多个数据集，用于群体，底物和复制的每个组合，然后在每个新数据集上执行PCA。

我怎样才能在完整的数据集获得独立的PC1，PC2 ...结果的人群中，基材每个组合进行PCA，并复制最有效？我有一个关于将数据集转换为列表的想法，但不确定如何将princomp函数应用于列表。我在正确的轨道上吗？

的样本数据：

TestData<- structure(list(Location = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
            "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
            "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", 
            "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D"), 
       Substrate = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D", 
          "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D", 
          "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D", 
          "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"), 
       Replicate = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
          1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
          1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
          1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), 
       Adult_Weight = c(0.0092, 0.0083, 0.0088, 0.0077, 0.0088, 0.01, 
           0.0099, 0.011, 0.0078, 0.0086, 0.0071, 0.0093, 
           0.0111, 0.01, 0.0097, 0.0091, 0.0083, 0.0098, 
           0.0093, 0.009, 0.0114, 0.0087, 0.0094, 0.0096, 
           0.0099, 0.0105, 0.0091, 0.0115, 0.0106, 0.0104, 
           0.0113, 0.0115, 0.0107, 0.0126, 0.0106, 0.0101, 
           0.0095, 0.0113, 0.0111, 0.0118, 0.0114, 0.0123, 
           0.0119, 0.0103, 0.0119, 0.0116, 0.0112, 0.0114), 
       Adult_Thorax_Width = c(1.31, 1.31, 1.43, 1.45, 1.52, 1.43, 1.57, 1.45, 1.43, 1.54, 1.32, 1.49, 
            1.58, 1.36, 1.42, 1.45, 1.48, 1.38, 1.55, 1.46, 1.52, 1.42, 1.6, 1.49, 
            1.48, 1.58, 1.51, 1.53, 1.54, 1.76, 1.63, 1.62, 1.44, 1.51, 1.53, 1.58, 
            1.46, 1.94, 1.54, 2.09, 1.5, 1.65, 1.86, 1.54, 1.8, 1.98, 1.82, 1.63), 
       Adult_Wing_Length = c(1359L, 1377L, 1555L, 1559L, 1562L, 1578L, 1580L, 1588L, 1597L, 1598L, 1603L, 1605L, 
            1612L, 1614L, 1616L, 1617L, 1623L, 1628L, 1639L, 1642L, 1643L, 1649L, 1651L, 1652L, 
            1653L, 1653L, 1654L, 1656L, 1656L, 1656L, 1662L, 1664L, 1665L, 1668L, 1670L, 1670L, 
            1671L, 1672L, 1674L, 1682L, 1685L, 1687L, 1688L, 1694L, 1698L, 1698L, 1707L, 1708L), 
       Adult_Leg_Length = c(414L, 390L, 627L, 541L, 430L, 450L, 451L, 462L, 443L, 582L, 435L, 579L, 
            499L, 418L, 444L, 646L, 589L, 466L, 435L, 477L, 450L, 606L, 660L, 450L, 
            446L, 480L, 462L, 438L, 483L, 454L, 492L, 457L, 463L, 499L, 470L, 474L, 
            627L, 478L, 473L, 496L, 666L, 499L, 480L, 461L, 450L, 483L, 460L, 584L)), 
       .Names = c("Location", "Substrate", "Replicate", "Weight", "Thorax_Width", "Wing_Length", "Leg_Length"), 
       row.names = c(NA, 48L), 
       class = "data.frame")

来源

2014-10-10 Keith W. Larson

如果您提供了一个虚拟数据集，我会告诉你如何。 – 2014-10-10 11:00:53

你需要输入你的人口和治疗为因子变量，并有三次重复作为单独的行，如果我理解正确的数据组成。列类型会是这样的：

第一列人口：因素
第二列处理：因素
3日 - 6日塔测定：数字（共4列）

而且整体数据类应优选'data.frame'，因为'data.frame'您的列可能有不同的类类型（不像'矩阵'）。

下面是一个根据因子变量对示例Iris数据集进行分层的示例，此处为'虹膜$物种'。如果你要为分层，你可以使用两个（或更多）矩阵过柱作为对指数参数输入多重因素。你确定你确实不是指具有注释的单个PCA吗？这可以通过将你的因子类型变量改变为数字并在散点图中给它们加注释来容易地完成，例如，通过 '山口'（=颜色）和 'PCH'（=符号）的参数。

data(iris) # Load the example Iris-dataset 
class(iris) 
lapply(iris, FUN=class) 
#> class(iris) 
#[1] "data.frame" 
#> 
#> lapply(iris, FUN=class) 
#$Sepal.Length 
#[1] "numeric" 
# 
#$Sepal.Width 
#[1] "numeric" 
# 
#$Petal.Length 
#[1] "numeric" 
# 
#$Petal.Width 
#[1] "numeric" 
# 
#$Species 
#[1] "factor" 

par(mfrow=c(2,2), mar=c(4,4,2,1)) 
# Separate PCA plot for each Species 
# Apply our defined PCA-function where each unique INDICES are handled as a separate function call 
by(iris, INDICES=iris$Species, FUN=function(z){ 
    # Use numeric fields for the PCA 
    pca <- prcomp(z[,unlist(lapply(z, FUN=class))=="numeric"]) 
    plot(pca$x[,1:2], pch=16, main=z[1,"Species"]) # 2 first principal components 
    z 
}) 

# Color annotation 
# Use numeric fields for the PCA 
pca <- prcomp(iris[,unlist(lapply(iris, FUN=class))=="numeric"]) 
plot(pca$x[,1:2], pch=16, col=as.numeric(iris[,"Species"]), main="Color annotation") # 2 first principal components 
legend("bottom", pch=16, col=unique(as.numeric(iris[,"Species"])), legend=unique(iris[,"Species"]))

PCA example

注意，PCA轴是不在第一三个面板从左上角计数是相同的。这是由于PCA计算中的协方差矩阵在仅计算分组PCA时不相同。

另外，如果你想有一个单一的PCA，只是情节属于不同类别在自己的窗口观察，你可以尝试一些在该行：

par(mfrow=c(1,3)) 
# Compute the PCA 
pca <- prcomp(iris[,unlist(lapply(iris, FUN=class))=="numeric"]) 
# Apply a plotting function over unique values of iris$Species, notice we always plot the same 'pca' object in all categories 
lapply(unique(iris$Species), FUN=function(z) { 
    plot(pca$x[which(z==iris$Species),1:2], xlim=extendrange(pca$x[,1]), ylim=extendrange(pca$x[,2]),pch=16, main=z) 
})

pca2

编辑：

出了“通过” - 函数的帮助文件： ‘指数：一个因素或因素的列表，每个长度nrow（数据）的’。因此，如果我们通过函数将列表中的索引提供给，那么我们可以对多个阶乘变量进行分层。这是一个人造的例子，其中'第一'和'第二'是两个同时分析数据的因素。这应该是微不足道的扩展到三个（或更多）变量：

ex <- cbind(matrix(rnorm(400), ncol=4), first = c("A", "B"), second = c("foo", "bar", "asd", "fgh", "jkl")) by(ex, INDICES=list(ex[,"first"], ex[,"second"]), FUN=function(z) z) # Modify the above function provided in FUN to suit your needs

来源

2014-10-10 13:16:40

我现在已经包含了一些示例数据。第三栏也是一个因素，这是重复数字。列4：7是测量结果。 – 2014-10-10 13:34:50

我看到如何使用“by”命令来构建一个函数，该函数在单个变量“Species”上执行PCA。我可以怎样做这三个变量，位置，底物，复制？当然，我可以创建一个新的变量来合并这三个字段，但是它们是一个更好的方法吗？ – 2014-10-10 15:47:39

Hello Keith，you'by'对一个或多个因子变量的数据进行分层。我现在用一个例子来编辑我的帖子，其中一个人造数据被两个因子变量分割。你的列表将包含三个变量，列表中的每个成员都是“位置”，“基底”或“复制”的矢量之一。 – 2014-10-12 01:15:50

如何针对具有多个组的数据集对每个组进行PCA？

回答

相关问题