我一直在寻找一个针对我的问题的直观解决方案。 我有一个巨大的单词列表,其中我必须根据一些条件插入一个特殊字符。 因此,如果两/三个字母词出现在一个小区,我想加上“+”左右吧根据现有字词在R中插入特殊字符
例
global b2b banking
会转化为global +b2b+ banking
how to finance commercial ale estate
会转化为how +to+ finance commercial +ale+ estate
下面是示例数据集:
sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
"W Hotels")
data <- data.frame(sample)
此外,是否可以删除具有长度为1的字符的行? 实施例:
W Hotels
对于所有的单字母字我试图与GSUB除去它们,
gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample)
这应该从设置的数据集合中移除。
任何帮助,高度赞赏。
编辑1
感谢您的帮助,我添加了几行吧:
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)]
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data
sample
1 commercial++funding
2 global+++b2b+++banking
3 how++++to+++finance++commercial+++ale+++estate
4 international++currency++account
5 miami++imports++banking
6 hsbc++supply++chain++financing
7 international++business++expansion
8 grow++business+++in++++us+++banking
9 commercial++trade++asia++pacific
10 business++line+++of+++credits++hsbc
11 britain++commercial++banking
12 fx+++settlement++hsbc
不知怎的,我无法删除 “+” 与 “” 与GSUB?我究竟做错了什么 ? 所以"fx+,settlement,hsbc"
应该是"fx+settlement,hsbc"
,但它正在取代,另外还有++。
所以,你的意思是你想删除包含整个单词只由一个字母的任何项目? –
是的,所以任何一行如果它有多个单词,但如果一个单词有一个长度,我想删除该行,然后剩下的我想在两个字母和三个字母单词之前和之后添加特殊字符“+”。 – PSraj
好,那么,你有什么尝试? –