R，聚合函数显然会导致列级别的丢失？

我刚刚在RGUI中遇到了一个奇怪的情况......我使用了和往常一样的脚本来将我的data.frame转换为ggplot2的正确形状。所以，我的数据如下所示：R，聚合函数显然会导致列级别的丢失？

 time days treatment nucleic_acid habitat parallel disturbance  variable cellcounts  value 
1 1 2 control   dna water  1   none  Proteobacteria  batch  0.000000000 
2 2 22 control   dna water  1   none  Proteobacteria  batch  0.003586543 
3 1 2 treated   dna water  1   none  Proteobacteria  batch  0.000000000 
4 2 22 treated   dna biofilm  1   none  Proteobacteria  NA  0.000000000 

'data.frame': 185648 obs. of 10 variables: 
$ time  : int 5 5 5 5 5 5 6 6 6 6 ... 
$ days  : int 62 62 62 62 62 62 69 69 69 69 ... 
$ treatment : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ... 
$ parallel : int 1 2 3 1 2 3 1 2 3 1 ... 
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ... 
$ habitat  : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ... 
$ cellcounts : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ... 
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ... 
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ value  : num 0 0 0 0 0 0 0 0 0 0 ...

，我想aggregate计算我起来的平均值，以3个相似之处：

df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)

之后，列“栖息地”水平“生物膜”丢失了。

df_mean<-droplevels(df_mean) 

str(df_mean) 
'data.frame': 44608 obs. of 9 variables: 
$ time  : int 1 2 1 2 1 2 1 2 1 2 ... 
$ days  : int 2 22 2 22 2 22 2 22 2 22 ... 
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ... 
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ... 
$ habitat  : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ... 
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ... 
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ... 
$ cellcounts : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ value  : num 0 0.00359 0 0 0 ...

所以我花了很多时间（其实我只是意识到了这一点，有更多的问题，现在似乎都aggregate相关）寻找到这一点。我删除了“cellcounts”列，它工作。有趣的是，“细胞计数”和“栖息地”栏通常在“生物膜”情况下保持一致，因此冗余信息（“生物膜”始终以“NA”表示）。这是原因吗？但它以前一直很有用，所以我不会为此感到头疼。 base::aggregate函数有没有改变？你有我的解释吗？我使用的R-3.4.0，使用其他的套餐重塑，reshape2和GGPLOT2

THX了很多，困惑crazysantaclaus

来源

2017-08-16 crazysantaclaus

问题来自NA，也许你的文件是在过去不同的加载这些存储为字符串而不是NA值？下面是将它们设置为一个字符串"NA"来解决它的方式：

levels(df$cellcounts) <- c(levels(df$cellcounts),"NA") 
df$cellcounts[is.na(df$cellcounts)] <- "NA" 
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE) 
df_mean<-droplevels(df_mean) 
str(df_mean) 

'data.frame': 4 obs. of 9 variables: 
    $ time  : int 1 2 1 2 
$ days  : int 2 22 2 22 
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2 
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1 
$ habitat  : Factor w/ 2 levels "biofilm","water": 2 2 2 1 
$ disturbance : Factor w/ 1 level "none": 1 1 1 1 
$ variable : Factor w/ 1 level "Proteobacteria": 1 1 1 1 
$ cellcounts : Factor w/ 2 levels "batch","NA": 1 1 1 2 
$ value  : num 0 0.00359 0 0

数据

df <- read.table(text="  time days treatment nucleic_acid habitat parallel disturbance  variable cellcounts  value 
    1 1 2 control   dna water  1   none  Proteobacteria  batch  0.000000000 
         2 2 22 control   dna water  1   none  Proteobacteria  batch  0.003586543 
         3 1 2 treated   dna water  1   none  Proteobacteria  batch  0.000000000 
         4 2 22 treated   dna biofilm  1   none  Proteobacteria  NA  0.000000000 

         ",header=T)

来源

2017-08-16 14:52:23

嘿穆迪，这实际上可能是......我认为类似的，因为我什么都没有改变人，但仅仅通过使用R和Excel打开/关闭文件，R已经改变了R如何解释它。我会在下周检查它是否有效，然后确认你的答案！谢谢 – crazysantaclaus

是的，打开一个excel文件并保存而不改变任何东西实际上可以改变一些东西，至少是日期，并且它也可能会混淆'NAs'并不令人惊讶 –

抱歉让你久等了......但现在我有时间检查你的建议，它再次正常工作。我会记住这一点;-) – crazysantaclaus

R，聚合函数显然会导致列级别的丢失？

回答

相关问题