正则表达式从Dataframe中提取文本并插入到新列

-1

我一直在通过正则表达式的所有帖子打猎，但似乎无法为我工作。线的正则表达式从Dataframe中提取文本并插入到新列

实施例（某些字被删节或改变）

Df的$文本：“CommonWord＃79 - 事件类型1200秒[对象] xxx.xxx.xxx.xxx/## XXX .xxx.xxx.xxx/##端口：##

我想＃后提取的数值，并将其放置在一个新的列我想：DF $数< - 子（“\ ＃（[0-9] {2,4}）。*“，”\ 1“，df $ text）

结果是“CommonWord 79”我似乎找不到正确的正则表达式来删除第一个单词。
下一个正则表达式我想把“EVENT类型”拉到另一列。 “EVENT”和“type”都可以改变，所以我需要在“ - ”之后和“for”之前拉文本。
1. 我需要的最后两个正则表达式是IP地址和子网掩码，然后是端口号（仅限数字）。我需要所有这些到新的列。

对不起，长篇大论的问题。被敲打着我的头就这一个

解决部分1，事件类型和端口需要有一些问题，找到IP地址（只获得了第一位在

df$number <- sub(".*\\#(\\d{1,4}).*", "\\1", df$text) 
df$attackType <- sub(".*\\-.(\\w+\\s\\w+).*","\\1", df$text) 
df$port <- as.numeric(sub(".*\\:(\\d{1, })?","\\1", df$text))

第一组数字，例如actual ip是127.0.0.1/28，但是我得到了7.0.0.1/28返回。在弄清楚如何获得IP地址/掩码后，我需要确定如何在文本中找到多个结果冗长的正则表达式 - 期待稍后优化

df$IPs <- sub(".*(+\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/\\d{2, }).*","\\1", df$text)

来源

2016-11-29 user3192046

-1

你只是不得不加 “*” 表示任何#character数量

sub(".*\\#([0-9]{2,4}).*", "\\1", x)

＃之前创建一个新的列

df$new_col <- as.numeric(sub(".*\\#([0-9]{2,4}).*", "\\1", df$text))

来源

2016-11-29 04:55:22

感谢as.numeric！我可以按照你的建议让子工作，其中正则表达式是：“。* \\＃（\\ d {1,4}）。*“ 现在需要计算出其他需求。再次感谢 – user3192046

是那些x应该代表的数字？有些值会有帮助，尤其是考虑到IP地址并不完全遵循这种模式。

无论如何，我已经添加了一些东西来搜索。我喜欢将rex包与stringr::str_view_all结合使用来测试正则表达式模式。结果在查看器窗格中突出显示。

text <- "CommonWord #79 - EVENT type for 1200 seconds [Objects] 192.168.0.24/## xxx.xxx.xxx.xxx/## Port: 80" 
library(stringr) 
library(rex) 

# show matches where at least one digit follows # 
str_view_all(text, rex(at_least(digit, 1) %if_prev_is% "#")) 

# show matches where characters are after - and before 'for' 
str_view_all(text, rex((prints %if_prev_is% "-") %if_next_is% "for")) 

# show matches the x in your IP text match 1-3 digits, and end with/
str_view_all(text, rex(between(digit, 1, 3), dot, 
         between(digit, 1, 3), dot, 
         between(digit, 1, 3), dot, 
         between(digit, 1, 3), "/")) 

# show matches where digits follow 'Port:' 
str_view_all(text, rex(digits %if_prev_is% "Port: "))

来源

2016-11-29 04:55:36

x确实代表数字，我出于隐私原因进行了编辑，但假定它的IP地址，意味着它的四组数字范围从1 - 254.示例127.0.0.1，等等 – user3192046

正则表达式从Dataframe中提取文本并插入到新列

回答

相关问题