2015-03-13 35 views
0

我想从多个字符向量中删除多个模式。目前,我打算:从文本向量中删除多个模式r

a.vector <- gsub("@\\w+", "", a.vector) 
a.vector <- gsub("http\\w+", "", a.vector) 
a.vector <- gsub("[[:punct:]], "", a.vector) 

等等等等

这是痛苦的。我正在看这个问题&回答:R: gsub, pattern = vector and replacement = vector但它没有解决问题。

无论是mapply还是mgsub的工作。我做了这些载体

remove <- c("@\\w+", "http\\w+", "[[:punct:]]") 
substitute <- c("") 

无论mapply(gsub, remove, substitute, a.vector)也不mgsub(remove, substitute, a.vector) worked.

a.vector看起来是这样的:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                             
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg" 

我想:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                             
[4952] "you are phenomenal #mental #Writing" ` 

回答

1

尝试使用|结合您的子模式。例如

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental" 
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s) 
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental" 

但是,如果你有大量的模式,或者将一个模式的结果造成比赛给别人,这可能成为问题。

考虑创建您的remove载体如你所说,然后通过循环

> s1 <- s 
> remove<-c("@\\w+","http\\w+","[[:punct:]]") 
> for (p in remove) s1 <- gsub(p, "", s1) 
> s1 
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental" 

这种方法需要将它扩大到其应用到整个表或载体,当然。但是,如果将它放入返回最终字符串的函数中,则应该可以将它传递给apply变体之一

0

如果您正在查找的多个模式是固定的,并且不会从大小写在这种情况下,您可以考虑创建一个连接的正则表达式,将所有模式组合成一个超级正则表达式模式。

对于您所提供的例子,你可以尝试:

removePat <- "(@\\w+)|(http\\w+)|([[:punct:]])" 

a.vector <- gsub(removePat, "", a.vector)