2016-11-14 33 views
0

使用下面的示例,我想按CaseWorker分组数据帧,然后客户端,然后为每个客户端组确定是否在“任务”与“任务2”中的任务列表相同。使用Dplyr的“group_by”创建组,然后使用Stringr查找组之间的差异

如果每个处于“任务2”但不是“任务”的任务都可以提取并显示在新的列或数据框中,我会很高兴有一个简单的真或假,或更好。

所以基本上我需要确保“任务”和“任务2”为每个客户端包含相同的条目。

如果可能的话,我想坚持使用Dplyr和Stringr,或者至少留在Tidyverse中。我认为有一种使用“group_by”和“str_detect”或其他一些Stringr功能以优雅的方式实现这一点的方法。

CaseWorker<-c("John","John","John","John","John","John","Melanie","Melanie","Melanie","Melanie","Melanie","Melanie") 
Client<-c("Chris","Chris","Chris","Tom","Tom","Tom","Valerie","Valerie","Valerie","Tim","Tim","Tim") 
Task<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Make lunch","Make dinner","Feed cat","Buy groceries","Do homework","Iron shirt","Make lunch") 
Task2<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Feed cat","Make dinner","Feed cat","Iron shirt","Do homework","Iron shirt","Make lunch") 
Df<-data.frame(CaseWorker,Client,Task,Task2) 

回答

2

看看这是你在做什么。

首先,看看Task是否匹配Task2。如果不是,则将Task2作为新变量返回。我这个存储到一个新的数据帧df2

df2 <- Df %>% 
    mutate(match = Task == Task2, 
      non_match = ifelse(!match, Task2, "")) 
df2 

# CaseWorker Client   Task  Task2 match non_match 
# 1  John Chris  Feed cat Feed cat TRUE   
# 2  John Chris Make dinner Make dinner TRUE   
# 3  John Chris Iron shirt Iron shirt TRUE   
# 4  John  Tom Make dinner Make dinner TRUE   
# 5  John  Tom Do homework Do homework TRUE   
# 6  John  Tom Make lunch Feed cat FALSE Feed cat 
# 7  Melanie Valerie Make dinner Make dinner TRUE   
# 8  Melanie Valerie  Feed cat Feed cat TRUE   
# 9  Melanie Valerie Buy groceries Iron shirt FALSE Iron shirt 
# 10 Melanie  Tim Do homework Do homework TRUE   
# 11 Melanie  Tim Iron shirt Iron shirt TRUE   
# 12 Melanie  Tim Make lunch Make lunch TRUE   

然后summarise的结果,看看个别CaseWorker /Client双匹配的所有条目。

df2 %>% 
    group_by(CaseWorker, Client) %>% 
    summarise(n = n(), 
      matches = sum(match), 
      all_match = n == matches) 

# CaseWorker Client  n matches all_match 
#  <chr> <chr> <int> <int>  <lgl> 
# 1  John Chris  3  3  TRUE 
# 2  John  Tom  3  2  FALSE 
# 3 Melanie  Tim  3  3  TRUE 
# 4 Melanie Valerie  3  2  FALSE 

,如果你需要在你的原始数据集的all_match变量你可以的话当然合并此回你的数据帧。

+0

感谢您的回答!我发布了这个问题的“第二部分”,如果您有兴趣也可以发布一个更复杂但相似的问题。它以相同的问题名称发布,但在开始时使用“第2部分”。 – Mike

1

您可以简单地通过dplyr做到这一点,利用%in%

Df %>% 
    group_by(CaseWorker,Client) %>% 
    mutate(Check = Task %in% Task2) 

这取决于精确匹配的情况下,如果你担心,你可以在以下几点:

Df %>% 
    group_by(CaseWorker,Client) %>% 
    rowwise() %>% 
    mutate(Check = grepl(Task, Task2, ignore.case = TRUE)) 

但是你必须在mutate之前使用rowwise来解决grepl的向量化特性(或者大多数R函数)

0

如果你想使用stringr包。下面也可以为你工作。

Df %>% 
    group_by(CaseWorker,Client) %>% 
    mutate(Check=str_detect(as.character(Task),as.character(Task2)) 
0

这可能只是我的误解问题,但我想你可能是过于复杂的情况下,本就是你想要的仅仅是其中任务不匹配任务2的记录。

> Df[which(Df$Task != Df$Task2),] 

=== ========== ======= ============= ========== 
\ CaseWorker Client Task   Task2  
=== ========== ======= ============= ========== 
6 John  Tom  Make lunch  Feed cat 
9 Melanie  Valerie Buy groceries Iron shirt 
=== ========== ======= ============= ========== 
相关问题