在很多列上快速进行子集划分

我有一些代码可以识别数据框中的异常值，然后删除或限制它们。我正在尝试使用apply（）函数（或者其他方法）来加速删除过程。在很多列上快速进行子集划分

实例数据

https://github.com/crossfitAL/so_ex_data/blob/master/subset 
# this is the contents of a csv file, you will need to load it into your R session. 

# set up an example decision-matrix 
# rm.mat is a {length(cols) x 4} matrix -- in this example 8 x 4 
# rm.mat[,1:2] - identify the values for min/max outliers, respectively. 
# rm.mat[,3:4] - identify if you wish to remove min/max outliers, respectively. 
cols <- c(1, 6:12) # specify the columns you wish to examine 
rm.mat <- matrix(nrow = length(cols), ncol= 4, 
       dimnames= list(names(fico2[cols]), 
       c("out.min", "out.max","rm outliers?", "rm outliers?"))) 

# add example decision criteria 
rm.mat[, 1] <- apply(fico2[, cols], 2, quantile, probs= .05) 
rm.mat[, 2] <- apply(fico2[, cols], 2, quantile, probs= .95) 
rm.mat[, 3] <- replicate(4, c(0,1)) 
rm.mat[, 4] <- replicate(4, c(1,0))

这是我当前的子集代码：

df2 <- fico2 # create a copy of the data frame 
cnt <- 1  # add a count variable 
for (i in cols) { 
# for each column of interest in the data frame. Determine if there are min/max 
# outliers that you wish to remove, remove them.   
    if (rm.mat[cnt, 3] == 1 & rm.mat[cnt, 4] == 1) { 
    # subset/remove min and max outliers 
    df2 <- df2[df2[, i] >= rm.mat[cnt, 1] & df2[, i] <= rm.mat[cnt, 2], ] 
    } else if (rm.mat[cnt, 3] == 1 & rm.mat[cnt, 4] == 0) { 
    # subset/remove min outliers 
    df2 <- df2[df2[, i] >= rm.mat[cnt, 1], ] 
    } else if (rm.mat[cnt, 3] == 0 & rm.mat[cnt, 4] == 1) { 
    # subset/remove max outliers 
    df2 <- df2[df2[, i] <= rm.mat[cnt, 2], ] 
    } 
    cnt <- cnt + 1 
}

提出的解决方案：我想我应该通过一个应用型能够做到这一点函数，删除for循环/向量化加速代码。我遇到的问题是，我试图应用一个函数if-only-only-如果决策矩阵表明我应该。 IE-使用逻辑向量rm.mat[,3] or rm.mat[,4]来确定子集应用"["是否应该应用于数据帧df2。

任何帮助你将不胜感激！另外，请让我知道示例数据/代码是否足够。

来源

2013-02-19 Alex W

亚历喜的过滤的值大的矩阵，只是一个建议：我认为这会更有帮助，而不是发布如何清理数据，而是发布数据样本（或简化后的数据）。 – 2013-02-19 19:25:50

@RicardoSaporta - 不是我的实际数据。这是Coursera类的一些示例数据。我的数据很大，很暗。我认为这会更简单。 – 2013-02-19 19:28:16

@亚历山大，我是第二个RicardoSaporta的建议，如果你只是在没有太多介绍的情况下重点讨论你的问题，那就更好。我正在尝试第三次阅读！您的代码中没有评论。你希望人们看看代码并理解......我不认为很多人会试图回答。 – Arun 2013-02-19 19:42:44

这里有一个解决方案。只是为了澄清你的代码。希望其他人可以使用它来提供更好的解决方案。

所以，如果明白了，你有一个决策矩阵，看起来像这样：

rm.mat 
             c1 c2 c3 c4 
amount.funded.by.investors  27925.000 NA 0 1 
monthly.income     11666.670 NA 1 0 
open.credit.lines     18.000 NA 0 1 
revolving.credit.balance  40788.750 NA 1 0 
inquiries.in.the.last.6.months  3.000 NA 0 1 
debt.to.inc      28.299 NA 1 0 
int.rate       20.490 NA 0 1 
fico.num       775.000 NA 1 0

，并尝试根据该矩阵

colnames(rm.mat) <- paste('c',1:4,sep='')  
rm.mat <- as.data.frame(rm.mat) 
apply(rm.mat,1,function(y){ 
    h <- paste(y['c3'],y['c4'],sep='') 
    switch(h, 
      '11'= apply(df2,2, function(x) 
           df2[x >= y['c1'] & x <= y['c2'],]), ## we never have this!! 
      '10'= apply(df2,2, function(x) 
           df2[x >= y['c1'] , ]), ## here we apply by columns! 
      '01'= apply(df2,2,function(x) 
           df2[x <= y['c2'], ])) ## c2 is NA!! so !!! 
} 
)

来源

2013-02-19 21:08:34 agstudy

你的'rm.mat'不是我所拥有的。 'RM。mat [，1] < - apply（fico2 [，cols]，2，quantile，probs = .05） rm.mat [，2] < - apply（fico2 [，cols]，2，quantile，probs = .95 ）'。我认为你的解决方案可能有效......但我今晚必须回到它。我不熟悉使用'switch（）'。谢谢！ – 2013-02-19 21:28:30

@Alex我的解决方案真的是向您展示，在思考性能之前，您必须编写CLEAN和READABLE代码！一旦你有了这个，你可以调整它！当你有很多条件时，taht的开关确实是一个方便的功能。 – agstudy 2013-02-19 21:31:32

我同意你的意见。 A）我仍然在改进我的R编码;但是B）我认为这更多的是试图从一个具有200行代码的函数中提供一个简短而相对简单的例子......这表示，我真的很感激你，并且这个职位上的其他人尽管感到沮丧，仍然花时间完成这一工作。 – 2013-02-19 21:48:45

在很多列上快速进行子集划分

回答

相关问题