如何通过tm包删除单词中的括号？

比方说，我有文本中的一部分这样的文件：如何通过tm包删除单词中的括号？

"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."

我想删除“（API）”，它需要

corpus <- tm_map(corpus, removePunctuation)

之前完成取出后“（API）”，它应该是这个样子如下：

"Other segment comprised of our active pharmaceutical ingredient business,which..."

我搜索了很久，但所有我能找到大约只有删除括号，这个词的答案中的我不知道也要在语料库中出现。

我真的需要有人给我一些提示PLZ。

来源

2015-10-16 John Chou

你可以用一个更聪明tokeniser，如在quanteda包，其中removePunct = TRUE会自动删除括号。

quanteda::tokenize(txt, removePunct = TRUE) 
## tokenizedText object from 1 document. 
## Component 1 : 
## [1] "Other"   "segment"  "comprised"  "of"    "our"   ## "active"   "pharmaceutical" 
## [8] "ingredient"  "API"   "business"  "which"

补充：

如果你想先tokenise文本，那么你就需要一个lapply直到gsub我们quanteda添加一个正则表达式valuetype到removeFeatures.tokenizedTexts()。但是，这会工作：

# tokenized version 
require(quanteda) 
toks <- tokenize(txt, what = "fasterword", simplify = TRUE) 
toks[-grep("^\\(.*\\)$", toks)] 
## [1] "Other"    "segment"   "comprised"   "of"    "our"    "active"   
## [7] "pharmaceutical" "ingredient"  "business,which..."

如果你只是想去掉括号表达式中的问题，那么你不需要任何TM或quanteda：

# exactly as in the question 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt) 
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..." 

# with added punctuation 
txt2 <- "ingredient (API), business,which..." 
txt3 <- "ingredient (API). New sentence..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2) 
## [1] "ingredient, business,which..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3) 
## [1] "ingredient. New sentence..."

的时间越长正则表达式还捕获括号表达式结束句子或附加标点符号（如逗号）的情况。

来源

2015-10-16 15:51:28

感谢您的回答，但我需要删除的不仅仅是括号。这个词也需要删除。 –

好的我修改了我的答案，参见上文。 –

感谢您的帮助，这也适用！ –

如果只有单一的话，怎么样（未经测试）：

removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)}) 
tm_map(corpus, removeBracketed)

来源

2015-10-16 11:15:29 dash2

非常感谢！真的行！ –

如何通过tm包删除单词中的括号？

回答

相关问题