2017-03-02 65 views
1

我一直在寻找一个针对我的问题的直观解决方案。 我有一个巨大的单词列表,其中我必须根据一些条件插入一个特殊字符。 因此,如果两/三个字母词出现在一个小区,我想加上“+”左右吧根据现有字词在R中插入特殊字符

global b2b banking会转化为global +b2b+ banking

how to finance commercial ale estate会转化为how +to+ finance commercial +ale+ estate

下面是示例数据集:

sample <- c("commercial funding", 
"global b2b banking" 
"how to finance commercial ale estate" 
"opening a commercial account", 
"international currency account", 
"miami imports banking", 
"hsbc supply chain financing", 
"international business expansion", 
"grow business in Us banking", 
"commercial trade Asia Pacific", 
"business line of credits hsbc", 
"Britain commercial banking", 
"fx settlement hsbc", 
"W Hotels") 
data <- data.frame(sample) 

此外,是否可以删除具有长度为1的字符的行? 实施例:

W Hotels 

对于所有的单字母字我试图与GSUB除去它们,

gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample) 

这应该从设置的数据集合中移除。

任何帮助,高度赞赏。

编辑1

感谢您的帮助,我添加了几行吧:

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels") 
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)] 
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample) 
sample <- gsub(" ",",",sample) 
sample <- gsub("+,","+",sample) 
sample <- gsub(",+","+",sample) 
sample <- tolower(sample) 
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample) 
data <- data.frame(sample) 
data 




              sample 
1        commercial++funding 
2       global+++b2b+++banking 
3 how++++to+++finance++commercial+++ale+++estate 
4    international++currency++account 
5       miami++imports++banking 
6     hsbc++supply++chain++financing 
7    international++business++expansion 
8    grow++business+++in++++us+++banking 
9    commercial++trade++asia++pacific 
10   business++line+++of+++credits++hsbc 
11     britain++commercial++banking 
12       fx+++settlement++hsbc 

不知怎的,我无法删除 “+” 与 “” 与GSUB?我究竟做错了什么 ? 所以"fx+,settlement,hsbc"应该是"fx+settlement,hsbc",但它正在取代,另外还有++。

+0

所以,你的意思是你想删除包含整个单词只由一个字母的任何项目? –

+0

是的,所以任何一行如果它有多个单词,但如果一个单词有一个长度,我想删除该行,然后剩下的我想在两个字母和三个字母单词之前和之后添加特殊字符“+”。 – PSraj

+1

好,那么,你有什么尝试? –

回答

2

您需要在2个步骤中完成此操作:用1个字母的整个单词删除项目,然后将约2-3个字母的单词添加到+

使用

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels") 
sample <- sample[!grepl("\\b[[:alnum:]]\\b",sample)] 
sample <- gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample) 
data <- data.frame(sample) 
data 

R demo

sample[!grepl("\\b[[:alnum:]]\\b",sample)]删除包含单词边界(\b),信([[:alnum:]])和字边界模式的项目。

gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)行代替所有2-3个字母的整个单词,这些单词用+括起来。

结果:

         sample 
1       commercial funding 
2      global +b2b+ banking 
3 +how+ +to+ finance commercial +ale+ estate 
4    international currency account 
5      miami imports banking 
6     hsbc supply chain financing 
7   international business expansion 
8    grow business +in+ +Us+ banking 
9    commercial trade Asia Pacific 
10   business line +of+ credits hsbc 
11     Britain commercial banking 
12      +fx+ settlement hsbc 

注意W Hotelsopening a commercial account得到过滤掉。

答到编辑

你增加了一些替换操作的代码,但使用的是文字字符串替换,因此,你只需要通过fixed=TRUE说法:

sample <- gsub(" ",",",sample, fixed=TRUE) 
sample <- gsub("+,","+",sample, fixed=TRUE) 
sample <- gsub(",+","+",sample, fixed=TRUE) 

否则,+被视为正则表达式量词,必须转义为字面加号。

另外,如果你需要从字符串的开头删除所有+,使用

sample <- sub("^\\++", "", sample) 
+1

如果'b2b'要变成'+ b2b +',你需要在模式中包含'[:digit:]''。 – coletl

+0

我用'[[:alnum:]]'(字母+数字)替换了所有'[[:alpha:]]'(只是字母)。让OP决定用什么来过滤以及用什么来包装。 –

+0

你的解决方案效果很好,只是最后一件事我坚持,我无法gsub +,只是+,你能帮助吗? – PSraj