2017-08-27 148 views
0

我正尝试使用FactoMineR包在我的数据集上实现PCA和MCA。FactoMineR中的PCA摘要中的ctr,距离和尺寸究竟是什么?

我有一个数据集,经过一些初步清理之后,我在其上应用了PCA()函数。我试图理解结果的总结。

library(reshape) 
library(gridExtra) 
library(gdata) 
library(ggplot2) 
library(ggbiplot) 
library(FactoMineR) 

x <- read.csv('cars.csv',stringsAsFactors = FALSE) 
y <- na.omit(x) 

y <- y[,c(-8,-9)] 
s <- y[,-1] 
rownames(s) <- make.names(y[,1], unique = TRUE) 


res.pca <- PCA(s, quanti.sup = NULL, quali.sup=NULL,scale.unit = TRUE,ncp=2) 
summary(res.pca) 

这是summary(res.pca)打印出我的控制台

Call: 
PCA(X = s, scale.unit = TRUE, ncp = 2, quanti.sup = NULL, quali.sup = NULL) 


Eigenvalues 
         Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 
Variance    4.788 0.729 0.258 0.125 0.063 0.036 
% of var.    79.804 12.144 4.308 2.086 1.053 0.605 
Cumulative % of var. 79.804 91.948 96.256 98.342 99.395 100.000 

Individuals (the 10 first) 
           Dist Dim.1 ctr cos2 Dim.2 ctr cos2 
chevrolet.chevelle.malibu | 2.516 | 2.326 0.288 0.855 | -0.572 0.115 0.052 | 
buick.skylark.320   | 3.307 | 3.206 0.548 0.940 | -0.683 0.163 0.043 | 
plymouth.satellite  | 2.915 | 2.670 0.380 0.839 | -0.994 0.346 0.116 | 
amc.rebel.sst    | 2.749 | 2.605 0.362 0.898 | -0.623 0.136 0.051 | 
ford.torino    | 2.908 | 2.600 0.360 0.799 | -1.094 0.419 0.141 | 
ford.galaxie.500   | 4.578 | 4.401 1.032 0.924 | -1.011 0.358 0.049 | 
chevrolet.impala   | 5.210 | 4.920 1.289 0.892 | -1.368 0.655 0.069 | 
plymouth.fury.iii   | 5.144 | 4.836 1.246 0.884 | -1.537 0.827 0.089 | 
pontiac.catalina   | 5.165 | 4.910 1.285 0.904 | -1.041 0.379 0.041 | 
amc.ambassador.dpl  | 4.406 | 4.056 0.876 0.847 | -1.668 0.974 0.143 | 

Variables 
          Dim.1 ctr cos2 Dim.2 ctr cos2 
Cylinders     | 0.942 18.543 0.888 | 0.127 2.200 0.016 | 
Displacement    | 0.971 19.672 0.942 | 0.093 1.177 0.009 | 
Horsepower    | 0.950 18.846 0.902 | -0.142 2.761 0.020 | 
Weight     | 0.941 18.499 0.886 | 0.244 8.185 0.060 | 
MPG      | -0.873 15.918 0.762 | -0.209 5.994 0.044 | 
Acceleration    | -0.639 8.522 0.408 | 0.762 79.683 0.581 | 

虽然我从这个汇总明白了一切,我不知道什么DIST,点击率和朦胧的数据点的意思即

Individuals (the 10 first) 
           Dist Dim.1 ctr cos2 Dim.2 ctr cos2 
chevrolet.chevelle.malibu | 2.516 | 2.326 0.288 0.855 | -0.572 0.115 0.052 | 
buick.skylark.320   | 3.307 | 3.206 0.548 0.940 | -0.683 0.163 0.043 | 
plymouth.satellite  | 2.915 | 2.670 0.380 0.839 | -0.994 0.346 0.116 | 
amc.rebel.sst    | 2.749 | 2.605 0.362 0.898 | -0.623 0.136 0.051 | 

回答

1

让我们来看一下基于包中的示例数据集的个人摘要表,以供说明:

library(FactoMineR) 
data(decathlon) 
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13) 

> summary(res.pca) 
Call: 
PCA(X = decathlon, ncp = 5, quanti.sup = 11:12, quali.sup = 13) 
... 
Individuals (the 10 first) 
       Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr 
SEBRLE  | 2.369 | 0.792 0.467 0.112 | 0.772 0.836 0.106 | 0.827 1.187 
CLAY  | 3.507 | 1.235 1.137 0.124 | 0.575 0.464 0.027 | 2.141 7.960 
KARPOV  | 3.396 | 1.358 1.375 0.160 | 0.484 0.329 0.020 | 1.956 6.644 
... 

DIST可以被认为是在所有相关列数据集中的个体测量的综合性指标,如sqrt(rowSums(X^2)),其中X是输入数据集s的缩放版本(修远补充变量后计算)。

如果PCA默认选项到位即scale.unit = TRUErow.w = NULLcol.w = NULL,X应相当于scale(as.matrix(<trimmed down dataset>)) * sqrt(n/n-1)。我没有检查过这个非默认选项,因为我发现直觉解释比这里的详细计算更重要。

# verify the calculated values against summary table's Dist values 
> X <- scale(as.matrix(decathlon[,1:10])) * sqrt(nrow(decathlon)/(nrow(decathlon) - 1)) 
> sqrt(rowSums(X^2)) 
    SEBRLE  CLAY  KARPOV  BERNARD  YURKOV  WARNERS ZSIVOCZKY 
    2.368839 3.507004 3.396399 2.762607 3.017906 2.427873 2.563128 
... 

Dim.X措施每一个人的原籍在多维空间主成分X.距离的投影为直观起见,使用plot(res.pca, choix = "ind")为indivudal因素地图,切换xlim/ylim/axes参数放大任何特定的个人,&与表值进行比较。在函数中检查?plot.PCA以获取更多参数。

# plot individual factor map in the first two principle components 
> plot(res.pca, axes = c(1, 2), choix = "ind") 

# zoom in check Serbrle, Clay, & Karpov's coordinates 
> plot(res.pca, axes = c(1, 2), choix = "ind", xlim = c(0, 2), ylim = c(0, 1)) 

individual factor map, zoomed in

CTR表示每一个人的给定的主成分的贡献,以百分比的形式。您可以从res.pca$ind$contrib获取完整捐款清单。每列总和达100(%)。

# view each individual's contribution to each principle component 
> head(res.pca$ind$contrib) 
      Dim.1  Dim.2 Dim.3  Dim.4  Dim.5 
SEBRLE 0.46715109 0.8359506 1.186888 3.1842186 1.7811617 
CLAY 1.13695340 0.4635341 7.959744 0.2905893 13.8872052 
KARPOV 1.37515734 0.3289363 6.643820 7.9543342 2.2523610 
BERNARD 0.27693912 1.0740657 1.374952 11.3801552 0.4658144 
YURKOV 0.25595504 6.3757577 2.605847 1.7611939 5.5775065 
WARNERS 0.09494738 3.9862179 1.020117 0.8014610 3.5736432 

# verify each principle component's contributions sum up to 100%. 
> colSums(res.pca$ind$contrib) 
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 
    100 100 100 100 100 

cos2是每个主成分的平方余弦,如(Dim.X /距离)来计算^ 2。对于给定的主成分,它越接近1,主要成分就越好地捕获该个体的所有特征。

# verify the calculated values against summary table's cos2 values 
> head((res.pca$ind$coord/res.pca$ind$dist)^2) 
      Dim.1  Dim.2  Dim.3  Dim.4  Dim.5 
SEBRLE 0.11167888 0.10610262 0.12183534 0.24588345 0.08911755 
CLAY 0.12400941 0.02684265 0.37278712 0.01023775 0.31701007 
KARPOV 0.15991886 0.02030911 0.33175306 0.29878849 0.05481905 
BERNARD 0.04867778 0.10023262 0.10377289 0.64611132 0.01713585 
YURKOV 0.03769960 0.49858212 0.16480554 0.08379015 0.17193305 
WARNERS 0.02160805 0.48164324 0.09968563 0.05891525 0.17021193 

对于变量,“Dim.X”/“ctr”/“cos2”的解释是相似的。确切的计算更复杂,特别是如果您为行/列指定不一致的权重。您可以在那里查询PCA的代码。