2016-09-15 48 views
1

我有数据帧,看起来像这样:中的R用火柴在多个行中删除数据帧的行

content            ChatPosition 
This is a start line         START 
This is a middle line         MIDDLE 
This is a middle line         MIDDLE 
This is the last line         END 
This is a start line with a subsequent middle or end START 
This is another start line without a middle or an end START 
This is a start line         START 
This is a middle line         MIDDLE 
This is the last line         END 

content <- c("This is a start line" , "This is a middle line" , "This is a  middle line" ,"This is the last line" , 
     "This is a start line with a subsequent middle or end" , "This is  another start line without a middle or an end" , 
     "This is a start line" , "This is a middle line" , "This is the last line") 
ChatPosition <- c("START" , "MIDDLE" , "MIDDLE" , "END" , "START" ,"START" , "START" ,"MIDDLE" , "END") 
df <- data.frame(content, ChatPosition) 

我想删除它包含一个开始,但该行仅在下一行在ChatPosition列中不包含MIDDLE或END。

content            ChatPosition 
This is a start line         START 
This is a middle line         MIDDLE 
This is a middle line         MIDDLE 
This is the last line         END 
This is a start line         START 
This is a middle line         MIDDLE 
This is the last line         END 

nrow(df) 
jjj <- 0 

for(jjj in 1:nrow(df)) 
{ 
    # Check of a match of two STARTS over over multiple lines. 

if (df$ChatPosition[jjj]=="START" && df$ChatPosition[jjj+1]=="START") 

    { 
    print(df$content[jjj]) 
    } 

} 

我能够使用上面的代码打印出我想要删除的两行我想知道什么是最优雅的解决方案来删除这些行?

如果在这里有正确的方法或者是否有一个库可以更容易地完成这种类型的事情,那么这个方法也适用于嵌套方法吗?

问候 乔纳森

回答

2

这应该为你工作。

df[!(as.character(df$ChatPosition) == "START" & 
    c(tail(as.character(df$ChatPosition), -1), "END") == "START"), ] 

        content ChatPosition 
1  This is a start line  START 
2  This is a middle line  MIDDLE 
3 This is a  middle line  MIDDLE 
4  This is the last line   END 
7  This is a start line  START 
8  This is a middle line  MIDDLE 
9  This is the last line   END 

[]的第一个参数返回一个逻辑向量,它告诉R要保留哪些行。我使用tail(, -1)来获得下一个观察df$ChatPosition作比较。请注意,由于df$ChatPosition是一个因子变量,因此有必要将df$ChatPosition转换为第二行中的字符,以便在最终位置连接“END”。

+0

感谢伊莫是一个非常好的和优雅(非常R)解决方案。我已经试过了,它给出了所需的结果。非常感谢。 –

3

使用grep。您可以与您比较这解决方案的真实数据集循环速度

start_indices = grep("START",ChatPosition) 
end_indices = grep("END",ChatPosition) 

match_indices = sapply(end_indices,function(x) tail(start_indices[(start_indices-x)<0],1)) 
match_indices 
# [1] 1 7 
del_indices = setdiff(start_indices,match_indices) 
del_indices 
# [1] 5 6 
DF_subset = DF[-del_indices,] 
DF_subset 
        # content ChatPosition 
# 1  This is a start line  START 
# 2  This is a middle line  MIDDLE 
# 3 This is a  middle line  MIDDLE 
# 4  This is the last line   END 
# 7  This is a start line  START 
# 8  This is a middle line  MIDDLE 
# 9  This is the last line   END 
+0

感谢Osssan,这也是一个有用的解决方案,使用grep欢呼 –

1

另一种选择:

library(dplyr) 
filter(df, !(ChatPosition == "START" & lead(ChatPosition) == "START")) 

其中给出:

#      content ChatPosition 
#1  This is a start line  START 
#2  This is a middle line  MIDDLE 
#3 This is a  middle line  MIDDLE 
#4  This is the last line   END 
#5  This is a start line  START 
#6  This is a middle line  MIDDLE 
#7  This is the last line   END