2017-06-16 89 views
2

的最内部嵌套括号中提取文本从下面的文本字符串中,我尝试提取特定的字符串子集。从字符串

string <- c("(Intercept)", "scale(AspectCos_30)", "scale(CanCov_500)", 
      "scale(DST50_30)", "scale(Ele_30)", "scale(NDVI_Tin_250)", "scale(Slope_500)", 
      "I(scale(Slope_500)^2)", "scale(SlopeVar_30)", "scale(CanCov_1000)", 
      "scale(NDVI_Tin_1000)", "scale(Slope_1000)", "I(scale(Slope_1000)^2)", 
      "scale(log(SlopeVar_30 + 0.001))", "scale(CanCov_30)", "scale(Slope_30)", 
      "I(scale(Slope_30)^2)") 

一个好的结果会返回没有任何特殊字符的中央文本,如下所示。

Good <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "Slope", 
      "SlopeVar", "CanCov", "NDVI", "Slope", "Slope", "SlopeVar", "CanCov" "Slope", "Slope") 

然而优选地,所得到的字符串将分别说明与“斜率”和“SlopeVar”相关联的^2log。具体而言,包含^2的所有字符串都将转换为'SlopeSq',并且包含log的所有字符串都将转换为'SlopeVarPs',如下所示。

Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq", 
      "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov" "Slope", "SlopeSq") 

我还有很长的,丑陋的,和低效的代码序列让我几乎一半的好成绩,并希望任何建议。

回答

3

作为一个不那么高效的编码,我想有一个链的多个正则表达式来实现的结果(何正则表达式的每一行并在每行注释):

library(stringr) 
library(dplyr) 
string %>% 
    str_replace_all(".*log\\((.*?)(_.+?)?\\).*", "\\1Ps") %>% # deal with "log" entry 
    str_replace_all(".*\\((.*?\\))", "\\1") %>% # delete anything before the last "(" 
    str_replace_all("(_\\d+)?\\)\\^2", "Sq") %>% # take care of ^2 
    str_replace_all("(_.+)?\\)?", "") -> "outcome" # remove extra characters in the end (e.g. "_00" and ")") 


Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq", 
      "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov","Slope", "SlopeSq") 
all(outcome == Best) 
## TRUE 
+0

非常赞赏。清晰而翔实!也不知道你可以使用带有纵梁的管道操作员。凉。 –

+0

管道实际上来自'dplyr'。我编辑了我的答案。 –