从定义不明的用户输入数据中提取多个字符串

我期望从列中的条目（user_entry）具有不同格式并且每行可能包含多个实例的数据创建查找表。从定义不明的用户输入数据中提取多个字符串

# create example dataframe. 
id <- c(1111,1112,1113,1114) 
user_entry <- c("999/1001","1002;1003","999/1004\n999/1005","9991006 9991007") 
df <- data.frame(id,user_entry) 

> df 
    id   user_entry 
1 1111   999/1001 
2 1112   1002;1003 
3 1113 999/1004\n999/1005 
4 1114 9991006 9991007

我只在其可以或可以不被一个3位的位置的代码和/或分隔符之前诸如“/”或空间中的4位代码感兴趣。每个条目中可能有多个4位数的代码，我想在最终的查找表中分别列出每个代码（请参见下面的lookup）。

下面的代码做我正在寻找，但真正不合适循环内循环和内部增长的数据框。有没有更好的方法来做到这一点？

library(dplyr);library(stringr) 

# use stringr package to extract only digits 
df <- df %>% 
mutate(entries = str_extract_all(user_entry,"[[:digit:]]+")) %>% 
select(-user_entry) 

# initialise lookup dataframe 
lookup <- df[FALSE,] 
for (record in 1:nrow(df)){ 
    entries <- df$entries[[record]]  
    for (element in 1:length(entries)){ 
    # only interested in 4 digit codes 
    if (nchar(entries[element])>3){ 
     # remove 3 digit code if it is still attached 
     lookup_entry <- gsub('.*?(\\d{4})$','\\1',entries[element]) 
     lookup <- rbind(lookup,data.frame(id=df$id[[record]],entries=lookup_entry)) 
    } 
    } 
} 

> lookup 
    id entries 
1 1111 1001 
2 1112 1002 
3 1112 1003 
4 1113 1004 
5 1113 1005 
6 1114 1006 
7 1114 1007

来源

2017-04-18 lapsel

也许你可以提取每一个数字的最后4位数字序列？ ['str_extract_all（user_entry， “\\ d {4} \\ B”）']（https://regex101.com/r/Hm20nm/1）？ –

使用基础R，

matches <- regmatches(user_entry, gregexpr("(\\d{4})\\b", user_entry)) 

data.frame(
    id = rep(id, lengths(matches)), 
    entries = unlist(matches), 
    stringsAsFactors = FALSE 
) 
#  id entries 
# 1 1111 1001 
# 2 1112 1002 
# 3 1112 1003 
# 4 1113 1004 
# 5 1113 1005 
# 6 1114 1006 
# 7 1114 1007

来源

2017-04-18 15:49:12 r2evans

这使得假设，如果我们将始终有4位数字之前的999。我不知道这是否会始终如此。如果最后一项是1007999，那么正则表达式将返回7999. – Kristofersen

除此之外，它是比我更清洁的解决方案。想像我会发布，但OPs的好处。我不确定他究竟该如何处理999s – Kristofersen

这些模式表明（不管“999”）4位数的兴趣代码总是在右边，这在示例中是足够的。将SO问题简化为“最小/可重复”的风险是过度简化，没有提供足够的多样性。 \ *耸肩\ * – r2evans

不是很优雅，但我认为它应该工作你的情况：

library("tidyverse") 
df1 <- df %>% 
    separate_rows(user_entry, sep = '(/|;|\\n|\\s)') 

extract <- str_extract(df1$user_entry,"(?=\\d{3})\\d{4}$") 
df1$extract <- extract 
df2 <- df1[!is.na(df1$extract),] 
df2 


> df2 
    id user_entry extract 
#1111  1001 1001 
#1112  1002 1002 
#1112  1003 1003 
#1113  1004 1004 
#1113  1005 1005 
#1114 9991006 1006 
#1114 9991007 1007

来源

2017-04-18 16:26:32 PKumar

从定义不明的用户输入数据中提取多个字符串

回答

相关问题