你可以用一个更聪明tokeniser,如在quanteda包,其中removePunct = TRUE
会自动删除括号。
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
补充:
如果你想先tokenise文本,那么你就需要一个lapply
直到gsub
我们quanteda添加一个正则表达式valuetype
到removeFeatures.tokenizedTexts()
。但是,这会工作:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
如果你只是想去掉括号表达式中的问题,那么你不需要任何TM或quanteda:
# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient. New sentence..."
的时间越长正则表达式还捕获括号表达式结束句子或附加标点符号(如逗号)的情况。
感谢您的回答,但我需要删除的不仅仅是括号。这个词也需要删除。 –
好的我修改了我的答案,参见上文。 –
感谢您的帮助,这也适用! –