0
我想消除高于或低于2个标准差的离群值,对于具有类似名称的许多变量(太多到代码分别指定)。筛选多个存在的R data.table列以消除异常值
library(data.table)
irisdt <- data.table(iris)
myCols <- grep("Sepal", colnames(irisdt), value=TRUE)
# This works if I specify one column,
# but I have too many columns to specify, so need to use grep approach.
irisdt[, Sepal.Length.Outlier := (scale(Sepal.Length) < -2 | scale(Sepal.Length) > 2)]
# This does not work
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(x) < -2 | scale(x) > 2)})]
# This partially works, but changes in place
irisdt[, (myCols) := lapply(myCols, function(x) {(scale(irisdt[[x]]) < -2 | scale(irisdt[[x]]) > 2)})]
# How do I make new variables, for example "Sepal.Length.Outlier"?
myOutlierCols <- grep(".Outlier", colnames(irisdt), value=TRUE)
# How do I select rows matching multiple columns (&)?
irisdt[myOutlierCols=="FALSE"] # does not work
irisdt[, hasOutlier := lapply(myCols, myCols==TRUE)] # does not work
irisdt[hasOutlier=="FALSE"] # relies on line above, which doesn't work
也许函数可能需要一个data.table列并将其剥离高于或低于z分数截止值。这可以与lapply一起使用。
# This does not work
removeOutliers <- function(myColumn, cutoff = 3) {
lapply(myColumn, function (x) {
if (scale(myColumn[[x]]) < -cutoff | scale(myColumn[[x]]) > cutoff) {
x <- NA #specify individual value instead of column?
}
})
}
removeOutliers(irisdt[,Sepal.Length]) # for testing
trimmedIrisdt <- irisdt[,lapply(.SD, removeOutliers(.SD)), .SDcols = myCols] # could do by = grouping variable
# Once outliers are made NA, this would work:
trimmedIrisdt <- complete.cases(trimmedIrisdt)
谢谢你的非常简洁,明确的答案。这比我想要的方式要好得多! –
我试图修改它以用NA代替所有值 abs(scale(x))> = 2。 这是我尝试(不工作): irisdt [(myCols):= lapply(.SD,函数(X)(如果(as.logical(do.call(PMIN,lapply(.SD,函数( x)的ABS(刻度(X))<= 2)))) {NA}否则{X})) ,.SDcols = myCols] –
而且这不工作以替换细胞:irisdt [(myCols): (x){if(abs(scale(x))<= 2){x} else {NA}}),.SDcols = myCols]。你能解释一下do.call(pmin,...)吗? –