2017-08-16 77 views
0

我刚刚在RGUI中遇到了一个奇怪的情况......我使用了和往常一样的脚本来将我的data.frame转换为ggplot2的正确形状。所以,我的数据如下所示:R,聚合函数显然会导致列级别的丢失?

 time days treatment nucleic_acid habitat parallel disturbance  variable cellcounts  value 
1 1 2 control   dna water  1   none  Proteobacteria  batch  0.000000000 
2 2 22 control   dna water  1   none  Proteobacteria  batch  0.003586543 
3 1 2 treated   dna water  1   none  Proteobacteria  batch  0.000000000 
4 2 22 treated   dna biofilm  1   none  Proteobacteria  NA  0.000000000 

'data.frame': 185648 obs. of 10 variables: 
$ time  : int 5 5 5 5 5 5 6 6 6 6 ... 
$ days  : int 62 62 62 62 62 62 69 69 69 69 ... 
$ treatment : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ... 
$ parallel : int 1 2 3 1 2 3 1 2 3 1 ... 
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ... 
$ habitat  : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ... 
$ cellcounts : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ... 
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ... 
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ value  : num 0 0 0 0 0 0 0 0 0 0 ... 

,我想aggregate计算我起来的平均值,以3个相似之处:

df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean) 

之后,列“栖息地”水平“生物膜”丢失了。

df_mean<-droplevels(df_mean) 

str(df_mean) 
'data.frame': 44608 obs. of 9 variables: 
$ time  : int 1 2 1 2 1 2 1 2 1 2 ... 
$ days  : int 2 22 2 22 2 22 2 22 2 22 ... 
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ... 
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ... 
$ habitat  : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ... 
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ... 
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ... 
$ cellcounts : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ... 
$ value  : num 0 0.00359 0 0 0 ... 

所以我花了很多时间(其实我只是意识到了这一点,有更多的问题,现在似乎都aggregate相关)寻找到这一点。我删除了“cellcounts”列,它工作。有趣的是,“细胞计数”和“栖息地”栏通常在“生物膜”情况下保持一致,因此冗余信息(“生物膜”始终以“NA”表示)。这是原因吗?但它以前一直很有用,所以我不会为此感到头疼。 base::aggregate函数有没有改变?你有我的解释吗?我使用的R-3.4.0,使用其他的套餐重塑,reshape2和GGPLOT2

THX了很多,困惑crazysantaclaus

回答

1

问题来自NA,也许你的文件是在过去不同的加载这些存储为字符串而不是NA值?下面是将它们设置为一个字符串"NA"来解决它的方式:

levels(df$cellcounts) <- c(levels(df$cellcounts),"NA") 
df$cellcounts[is.na(df$cellcounts)] <- "NA" 
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE) 
df_mean<-droplevels(df_mean) 
str(df_mean) 

'data.frame': 4 obs. of 9 variables: 
    $ time  : int 1 2 1 2 
$ days  : int 2 22 2 22 
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2 
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1 
$ habitat  : Factor w/ 2 levels "biofilm","water": 2 2 2 1 
$ disturbance : Factor w/ 1 level "none": 1 1 1 1 
$ variable : Factor w/ 1 level "Proteobacteria": 1 1 1 1 
$ cellcounts : Factor w/ 2 levels "batch","NA": 1 1 1 2 
$ value  : num 0 0.00359 0 0 

数据

df <- read.table(text="  time days treatment nucleic_acid habitat parallel disturbance  variable cellcounts  value 
    1 1 2 control   dna water  1   none  Proteobacteria  batch  0.000000000 
         2 2 22 control   dna water  1   none  Proteobacteria  batch  0.003586543 
         3 1 2 treated   dna water  1   none  Proteobacteria  batch  0.000000000 
         4 2 22 treated   dna biofilm  1   none  Proteobacteria  NA  0.000000000 

         ",header=T) 
+0

嘿穆迪,这实际上可能是......我认为类似的,因为我什么都没有改变人,但仅仅通过使用R和Excel打开/关闭文件,R已经改变了R如何解释它。我会在下周检查它是否有效,然后确认你的答案!谢谢 – crazysantaclaus

+0

是的,打开一个excel文件并保存而不改变任何东西实际上可以改变一些东西,至少是日期,并且它也可能会混淆'NAs'并不令人惊讶 –

+0

抱歉让你久等了......但现在我有时间检查你的建议,它再次正常工作。我会记住这一点;-) – crazysantaclaus