2015-10-16 71 views
2

比方说,我有文本中的一部分这样的文件:如何通过tm包删除单词中的括号?

"Other segment comprised of our active pharmaceutical ingredient (API) business,which..." 

我想删除“(API)”,它需要

corpus <- tm_map(corpus, removePunctuation) 

之前完成取出后“(API)”,它应该是这个样子如下:

"Other segment comprised of our active pharmaceutical ingredient business,which..." 

我搜索了很久,但所有我能找到大约只有删除括号,这个词的答案中的我不知道也要在语料库中出现。

我真的需要有人给我一些提示PLZ。

回答

1

你可以用一个更聪明tokeniser,如在quanteda包,其中removePunct = TRUE会自动删除括号。

quanteda::tokenize(txt, removePunct = TRUE) 
## tokenizedText object from 1 document. 
## Component 1 : 
## [1] "Other"   "segment"  "comprised"  "of"    "our"   ## "active"   "pharmaceutical" 
## [8] "ingredient"  "API"   "business"  "which"   

补充:

如果你想先tokenise文本,那么你就需要一个lapply直到gsub我们quanteda添加一个正则表达式valuetyperemoveFeatures.tokenizedTexts()。但是,这会工作:

# tokenized version 
require(quanteda) 
toks <- tokenize(txt, what = "fasterword", simplify = TRUE) 
toks[-grep("^\\(.*\\)$", toks)] 
## [1] "Other"    "segment"   "comprised"   "of"    "our"    "active"   
## [7] "pharmaceutical" "ingredient"  "business,which..." 

如果你只是想去掉括号表达式中的问题,那么你不需要任何TMquanteda

# exactly as in the question 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt) 
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..." 

# with added punctuation 
txt2 <- "ingredient (API), business,which..." 
txt3 <- "ingredient (API). New sentence..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2) 
## [1] "ingredient, business,which..." 
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3) 
## [1] "ingredient. New sentence..." 

的时间越长正则表达式还捕获括号表达式结束句子或附加标点符号(如逗号)的情况。

+0

感谢您的回答,但我需要删除的不仅仅是括号。这个词也需要删除。 –

+0

好的我修改了我的答案,参见上文。 –

+0

感谢您的帮助,这也适用! –

1

如果只有单一的话,怎么样(未经测试):

removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)}) 
tm_map(corpus, removeBracketed) 
+0

非常感谢!真的行! –

相关问题