GSUB返回比正则表达式匹配更

我有一个像这样GSUB返回比正则表达式匹配更

x = c(
"blahblah, blah blah, Plate 3, blah blah" 
, 
"blah blah, blah_Plate 2_blah, blah" 
, 
"blah, blah, blah blah, blah plate_3" 
, 
"blah blah, blah, plate 5.txt" 
)

我想从这些文件名拿到车牌号码串，

，所以我测试我的正则表达式匹配

gsub("\\<Plate\\>.[0-9]","\\1",workdf_nums_plats$Bioplex_Files)

，所以我最终可以做这样的事情

workdf_nums_plats$plat <- ifelse(grepl("\\<Plate\\>.[0-9]", workdf_nums_plats$Bioplex_Files), gsub("\\<Plate\\>.[0-9]","\\1",workdf_nums_plats$Bioplex_Files), NA)

我正在返回整个字符串，我尝试使用\ b来设置字边界，但没有帮助。

任何意见将非常感谢！

来源

2017-08-16 AwesomeeExpress

好像你找到这些问题的答案有帮助。考虑接受对你最有帮助的人（答案左边的复选标记）。 – CPak

您需要内部定义捕获组，并在不区分大小写的方式匹配plate，而不是作为一个整体的单词，因为你需要_后，与之相匹配的（它是一个字字符，太）：

workdf_nums_plats$plat <- sub(".*?Plate.([0-9]+).*","\\1", workdf_nums_plats$Bioplex_Files, ignore.case=TRUE)

见regex demo和下方的R演示：

Bioplex_Files <- c("blahblah, blah blah, Plate 3, blah blah", "blah blah, blah_Plate 2_blah, blah", "blah, blah, blah blah, blah plate_3", "blah blah, blah, plate 5.txt") 
plat <- sub(".*?Plate.([0-9]+).*","\\1", Bioplex_Files, ignore.case=TRUE) 
plat 
## => [1] "3" "2" "3" "5"

图案的详细资料

.*? - 任何0+字符，尽可能少
Plate - plate子（情况不灵敏由于ignore.case=TRUE）
. - 任何字符
([0-9]+) - 第1组（称为从\1反向引用替换模式）匹配一个或多个数字
.* - 任何0+字符，直到字符串结尾。

如果你想匹配Plate作为一个整体的话，你可以在前面加上Plate与(?:_|\b)模式，".*?(?:_|\\b)Plate.([0-9]+).*"。这里，(?:_|\b)是一个非捕获组（即，它不创建$2或$1等），其匹配_或字边界。

另一种解决方案是匹配您需要的值，并且可以很方便地使用stringr为了这个目的：

> str_extract(Bioplex_Files, "(?i)(?<=Plate.)[0-9]+") 
[1] "3" "2" "3" "5"

这里，(?i)是不区分大小写的标志，(?<=Plate.)是一个积极的回顾后它声明Plate以及紧接在[0-9]+之前的任何字符--1个或更多数字（并且由于lookbehind模式是零长度断言（即，它不会将文本添加到匹配值），所以只有数字反复。

来源

2017-08-16 18:50:58

感谢您帮助我理解我自己的解决方案并提供替代解决方案！ – AwesomeeExpress

@AwesomeeExpress只是想知道你也可以使用'str_match（Bioplex_Files，“（？i）（？：_ | \\ b）Plate。（[0-9] +）”）[，2]'。乐意效劳。 –

一种方法是使用regmatches和regexec返回捕获的子表达式。

regmatches(test, regexec("[Pp]late.?([0-9]+)", test)) 
[[1]] 
[1] "Plate 3" "3"  

[[2]] 
[1] "Plate 2" "2"  

[[3]] 
[1] "plate_3" "3"  

[[4]] 
[1] "plate 5" "5"

这里，[PP]将匹配 “P” 或 “P”， “迟到” 比赛本身的字面意思， “？”匹配任何字符的0或1，“（）”捕获所需的值，即“[0-9] +”，一个或多个数字。

由于这会返回一个列表，所以您想使用sapply从每个列表项中取出第二个元素，如下所示。

sapply(regmatches(test, regexec("[Pp]late.?([0-9]+)", test)), "[", 2) 
[1] "3" "2" "3" "5"

数据

test <- 
c("blahblah, blah blah, Plate 3, blah blah", "blah blah, blah_Plate 2_blah, blah", 
"blah, blah, blah blah, blah plate_3", "blah blah, blah, plate 5.txt")

来源

2017-08-16 18:58:04 lmo

谢谢！你的解决方案比我的更优雅，正则表达式非常...深刻！ – AwesomeeExpress

GSUB返回比正则表达式匹配更

回答

相关问题