2017-04-20 84 views
0

(我有一种感觉,我会感到非常愚蠢,我得到一个答案后,但我只是无法弄清楚这一点。)在R中,如何对data.frame的特定子集执行操作?

我有一个data.frame结尾的空列。它将主要被纳入NA,但我想用一个值填充它的一些行。此列表示对data.frame中某列的数据缺失的猜测。

我最初data.frame看起来是这样的:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess 
--------------------------------------------------------- 
A | 6  | 3   | 6   | 
B | 7  | 3   | 7   | 
C | 6.5 | 3   | N/A  |median(df$MaxPlayers[df$MinPlayers ==3,]) 
D | 7  | 3   | 6   | 
E | 7  | 3   | 5   | 
F | 9.5 | 2   | 5   | 
G | 6  | 2   | 4   | 
H | 7  | 2   | 4   | 
I | 6.5 | 2   | N/A  |median(df$MaxPlayers[df$MinPlayers ==2,]) 
J | 7  | 2   | 2   | 
K | 7  | 2   | 4   | 

注意,两排中有 “N/A” 为MAXPLAYERS。我试图做的是使用我必须猜测MaxPlayers可能是什么的信息。如果3位玩家游戏的中位数(MaxPlayers)为6,则对于MinPlayers == 3和MaxPlayers == N/A的游戏,MaxPlayerGuess应该等于6。 (我试图在代码中表示什么价值MaxPlayerGuess应在本例中得到上面。)

产生的data.frame应该是这样的:

Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess 
--------------------------------------------------------- 
A | 6  | 3   | 6   | 
B | 7  | 3   | 7   | 
C | 6.5 | 3   | N/A  |6 
D | 7  | 3   | 6   | 
E | 7  | 3   | 5   | 
F | 9.5 | 2   | 5   | 
G | 6  | 2   | 4   | 
H | 7  | 2   | 4   | 
I | 6.5 | 2   | N/A  |4 
J | 7  | 2   | 2   | 
K | 7  | 2   | 4   | 

共享一个尝试的结果:

gld$MaxPlayersGuess <- ifelse(is.na(gld$MaxPlayers), median(gld$MaxPlayers[gld$MinPlayers,]), NA) 


Error in gld$MaxPlayers[gld$MinPlayers, ] : 
incorrect number of dimensions 

回答

2

更新相对于发布的示例。

这是我的一天,有时候更容易计算出你想要的,然后在你需要的时候抓住它,而不是使用所有这些逻辑连贯性。你试图想出一种方法来一次计算它,这就让它变得混乱,把它分解成几个步骤。您需要知道每个可能的“MinPlayer”组的“MaxPlayer”的中值。然后,您想在MaxPlayer丢失时使用该值。所以这是一个简单的方法来做到这一点。

#generate fake data 
MinPlayer <- rep(3:2, each = 4) 
MaxPlayer <- rep(2:5, each = 2, times = 2) 

df <- data.frame(MinPlayer, MaxPlayer) 

#replace some values of MaxPlayer with NA 
df$MaxPlayer <- ifelse(df$MaxPlayer == 3, NA, df$MaxPlayer) 

####STARTING DATA 
# > df 
# MinPlayer MaxPlayer 
# 1   3   2 
# 2   3   2 
# 3   3  NA 
# 4   3  NA 
# 5   2   4 
# 6   2   4 
# 7   2   5 
# 8   2   5 
# 9   3   2 
# 10   3   2 
# 11   3  NA 
# 12   3  NA 
# 13   2   4 
# 14   2   4 
# 15   2   5 
# 16   2   5 

####STEP 1 
#find the median of MaxPlayer for each group of MinPlayer (e.g., when MinPlayer == 1, 2 or whatever) 
#just add a column to the data frame that has the right median value for each subset of MinPlayer in it and grab that value to use later. 
library(plyr) #plyr is a great way to compute things across data subsets 
df <- ddply(df, c("MinPlayer"), transform, 
      median.minp = median(MaxPlayer, na.rm = TRUE)) #ignore NAs in the median 

####STEP 2 
#anytime that MaxPlayer == NA, grab the median value to replace the NA, otherwise keep the MaxPlayer value 
df$MaxPlayer <- ifelse(is.na(df$MaxPlayer), df$median.minp, df$MaxPlayer) 

####STEP 3 
#you had to compute an extra column you don't really want, so drop it now that you're done with it 
df <- df[ , !(names(df) %in% "median.minp")] 

####RESULT 
# > df 
# MinPlayer MaxPlayer 
# 1   2   4 
# 2   2   4 
# 3   2   5 
# 4   2   5 
# 5   2   4 
# 6   2   4 
# 7   2   5 
# 8   2   5 
# 9   3   2 
# 10   3   2 
# 11   3   2 
# 12   3   2 
# 13   3   2 
# 14   3   2 
# 15   3   2 
# 16   3   2 

老回答以下这里....

请张贴重复的例子!

#fake data 
this <- rep(1:2, each = 1, times = 2) 
that <- rep(3:2, each = 1, times = 2) 

df <- data.frame(this, that) 

如果你只是问基本的索引....例如,寻找到一些满足条件的值,这将返回与条件匹配值的行指数(查找哪些?):

> which(df$this < df$that) 
[1] 1 3 

这将返回符合条件的行的值而不是行索引 - 您只需使用由“which”返回的行索引在数据框的正确列(此处为“this”)中找到相应的值即可

> df[which(df$this < df$that), "this"] 
[1] 1 1 

如果您希望在“this”比这个“小于”时应用一些计算,并为您的数据框添加一个新列,则只需使用“ifelse”。否则创建一个符合条件的逻辑向量,然后将东西添加到符合条件的东西(例如,逻辑测试== TRUE的位置)。

#if "this" is < "that", multiply by 2 
df$result <- ifelse(df$this < df$that, df$this * 2, NA) 

> df 
this that result 
1 1 3  2 
2 2 2  NA 
3 1 3  2 
4 2 2  NA 

没有一个可重复的例子,不能再提供更多的例子。

+0

道歉,因为我不知道如何甚至开始编码,我不知道如何提供一个可重复的例子程序。 – Zelbinian

+0

谢谢你试图回答。通过尝试一些您的建议,我能够更好地看到问题并找出如何发布示例。 – Zelbinian

+0

@Zelbinian,所以一般你会把griffmer的标记为答案 – Chris

0

我认为你已经拥有了@ griffmer的答案中所需的一切。但一个不太优雅,但也许更直观的方式可能是一个循环:

## Your data: 
df <- data.frame(
     Game = LETTERS[1:11], 
     Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7), 
     MinPlayers = c(rep(3,5), rep(2,6)), 
     MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)  
) 

## Loop over rows: 
df$MaxPlayersGuess <- vapply(1:nrow(df), function(ii){ 
      if (is.na(df$MaxPlayers[ii])){ 
       median(df$MaxPlayers[df$MinPlayers == df$MinPlayers[ii]], 
         na.rm = TRUE)    
      } else { 
       df$MaxPlayers[ii] 
      }   
     }, numeric(1)) 

如果你想使用dplyr,让你

df 
# Game Rating MinPlayers MaxPlayers MaxPlayersGuess 
# 1  A 6.0   3   6    6 
# 2  B 7.0   3   7    7 
# 3  C 6.5   3   NA    6 
# 4  D 7.0   3   6    6 
# 5  E 7.0   3   5    5 
# 6  F 9.5   2   5    5 
# 7  G 6.0   2   4    4 
# 8  H 7.0   2   4    4 
# 9  I 6.5   2   NA    4 
# 10 J 7.0   2   2    2 
# 11 K 7.0   2   4    4 
0

,你可以尝试:

输入:

df <- data.frame(
    Game = LETTERS[1:11], 
    Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7), 
    MinPlayers = c(rep(3,5), rep(2,6)), 
    MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)  
) 

process:

df %>% 
    group_by(MinPlayers) %>% 
    mutate(MaxPlayers = if_else(is.na(MaxPlayers), median(MaxPlayers, na.rm=TRUE), MaxPlayers)) 

这会将数据基础MinPlayers分组,然后将MaxPlayers的中值赋予缺失数据的行。

输出:

Source: local data frame [11 x 4] 
Groups: MinPlayers [2] 

    Game Rating MinPlayers MaxPlayers 
    <fctr> <dbl>  <dbl>  <dbl> 
1  A 6.0   3   6 
2  B 7.0   3   7 
3  C 6.5   3   6 
4  D 7.0   3   6 
5  E 7.0   3   5 
6  F 9.5   2   5 
7  G 6.0   2   4 
8  H 7.0   2   4 
9  I 6.5   2   4 
10  J 7.0   2   2 
11  K 7.0   2   4 
相关问题