2017-06-12 67 views
1

我试图减少数据中每个因子变量的级数。我想先减少层级做2个操作的数量:减少每个因子dplyr方法的级别数

  1. 如果等级的数量比截止更大然后更换频率较低的水平上一个新台阶,直到水平的数量已经达到了cut-关闭
  2. 一个因素没有足够的观测替换水平提高到新的水平

我写的正常工作的功能,但我不喜欢的代码。如果剩余水平没有足够的观测值,这并不重要。我更喜欢dplyr方法。

ReplaceFactor <- function(data, max_levels, min_values_factor){ 
    # First make sure that not to many levels are in a factor 
    for(i in colnames(data)){ 
     if(class(data[[i]]) == "factor"){ 
      if(length(levels(data[[i]])) > max_levels){ 
       levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)] 
       data[!get(i) %in% levels_keep, (i) := "REMAIN"] 
       data[[i]] <- as.factor(as.character(data[[i]])) 
      } 
     } 
    } 
    # Now make sure that in each level has enough observations 
    for(i in colnames(data)){ 
     if(class(data[[i]]) == "factor"){ 
      if(min(table(data[[i]])) < min_values_factor){ 
       levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor] 
       data[get(i) %in% names(levels_replace), (i) := "REMAIN"] 
       data[[i]] <- as.factor(as.character(data[[i]])) 
      } 
     } 
    } 
    return(data) 
} 
df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"), 
       B = 1:9, 
       C = c("A","A","B","B","C","C","C","D","D"), 
       D = c("A","B","E", "E", "E","E","E", "E", "E")) 
str(df) 
'data.frame': 9 obs. of 4 variables: 
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3 
$ B: int 1 2 3 4 5 6 7 8 9 
$ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4 
$ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3 

dt2 <- ReplaceFactor(data = data.table(df), 
       max_levels = 3, 
       min_values_factor = 2) 
str(dt2) 
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables: 
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3 
$ B: int 1 2 3 4 5 6 7 8 9 
$ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3 
$ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1 
- attr(*, ".internal.selfref")=<externalptr> 
dt2 
    A B  C  D 
1: A 1  A REMAIN 
2: A 2  A REMAIN 
3: B 3 REMAIN  E 
4: B 4 REMAIN  E 
5: C 5  C  E 
6: C 6  C  E 
7: C 7  C  E 
8: C 8 REMAIN  E 
9: C 9 REMAIN  E 
+2

我建议你看看'forcats'软件包,它对这类任务有很好的功能:例如http://forcats.tidyverse.org/reference/ –

+0

'fct_lump'可能会有帮助 –

回答

5

使用forcats

library(dplyr) 
library(forcats) 

max_levels <- 3 
min_values_factor <- 2 
df %>% 
    mutate_if(is.factor, fct_lump, n = max_levels, 
      other_level = "REMAIN", ties.method = "first") %>% 
    mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1)/nrow(.), 
      other_level = "REMAIN") 

# A B  C  D 
# 1 A 1  A REMAIN 
# 2 A 2  A REMAIN 
# 3 B 3  B  E 
# 4 B 4  B  E 
# 5 C 5  C  E 
# 6 C 6  C  E 
# 7 C 7  C  E 
# 8 C 8 REMAIN  E 
# 9 C 9 REMAIN  E 

(呵呵,我是不是能复制你的函数的具体行为,但你可能会得到你想要的东西通过调整ties.method并从其减去1〜max_levels )。