2017-07-31 92 views
0

我想消除高于或低于2个标准差的离群值,对于具有类似名称的许多变量(太多到代码分别指定)。筛选多个存在的R data.table列以消除异常值

library(data.table) 

irisdt <- data.table(iris) 
myCols <- grep("Sepal", colnames(irisdt), value=TRUE) 

# This works if I specify one column, 
# but I have too many columns to specify, so need to use grep approach. 
irisdt[, Sepal.Length.Outlier := (scale(Sepal.Length) < -2 | scale(Sepal.Length) > 2)] 

# This does not work 
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(x) < -2 | scale(x) > 2)})] 

# This partially works, but changes in place 
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(irisdt[[x]]) < -2 | scale(irisdt[[x]]) > 2)})] 
# How do I make new variables, for example "Sepal.Length.Outlier"? 

myOutlierCols <- grep(".Outlier", colnames(irisdt), value=TRUE) 

# How do I select rows matching multiple columns (&)? 
irisdt[myOutlierCols=="FALSE"] # does not work 
irisdt[, hasOutlier := lapply(myCols, myCols==TRUE)] # does not work 
irisdt[hasOutlier=="FALSE"] # relies on line above, which doesn't work 

也许函数可能需要一个data.table列并将其剥离高于或低于z分数截止值。这可以与lapply一起使用。

# This does not work 
removeOutliers <- function(myColumn, cutoff = 3) { 
    lapply(myColumn, function (x) { 
    if (scale(myColumn[[x]]) < -cutoff | scale(myColumn[[x]]) > cutoff) { 
     x <- NA #specify individual value instead of column? 
    } 
    }) 
} 
removeOutliers(irisdt[,Sepal.Length]) # for testing 
trimmedIrisdt <- irisdt[,lapply(.SD, removeOutliers(.SD)), .SDcols = myCols] # could do by = grouping variable 

# Once outliers are made NA, this would work: 
trimmedIrisdt <- complete.cases(trimmedIrisdt) 

回答

2

我猜,这达到了目标:

irisdt[, keep := 
    as.logical(do.call(pmin, lapply(.SD, function(x) abs(scale(x)) <= 2))) 
, .SDcols = myCols] 

res = irisdt[(keep), !"keep"] 

    Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
    1:   5.1   3.5   1.4   0.2 setosa 
    2:   4.9   3.0   1.4   0.2 setosa 
    3:   4.7   3.2   1.3   0.2 setosa 
    4:   4.6   3.1   1.5   0.2 setosa 
    5:   5.0   3.6   1.4   0.2 setosa 
---                
135:   6.7   3.0   5.2   2.3 virginica 
136:   6.3   2.5   5.0   1.9 virginica 
137:   6.5   3.0   5.2   2.0 virginica 
138:   6.2   3.4   5.4   2.3 virginica 
139:   5.9   3.0   5.1   1.8 virginica 

如果有分组变量这应该也正常工作。我不知道它的统计可靠性。


工作原理:

  1. 测试每一个电池abs(scale(x)) <= 2
  2. 如果跨列的最小结果为TRUE,则保留该行。

要看看它是如何工作的细胞通过细胞...

library(data.table) 

mynewCols = paste0(myCols,"_outly") 
irisdt[, (mynewCols) := 
    lapply(.SD, function(x) replace(x, abs(scale(x)) <= 2, NA)) 
, .SDcols = myCols] 

然后浏览喜欢View(irisdt[rowSums(!is.na(irisdt[, ..mynewCols])) > 0])

+1

谢谢你的非常简洁,明确的答案。这比我想要的方式要好得多! –

+0

我试图修改它以用NA代替所有值 abs(scale(x))> = 2。 这是我尝试(不工作): irisdt [(myCols):= lapply(.SD,函数(X)(如果(as.logical(do.call(PMIN,lapply(.SD,函数( x)的ABS(刻度(X))<= 2)))) {NA}否则{X})) ,.SDcols = myCols] –

+0

而且这不工作以替换细胞:irisdt [(myCols): (x){if(abs(scale(x))<= 2){x} else {NA}}),.SDcols = myCols]。你能解释一下do.call(pmin,...)吗? –