这是一个在添加字段(“匹配”)上使用agrep
的示例,它是要用于识别重复项的所选字段的拼接(根据需要添加其他字段)。在这个例子中,列表索引对应于data.frame的行。
# make a mock data.frame
df <- read.csv(textConnection("
id,FNAME,LNAME
1,Aaron,Golding
2,Aroon,Golding
3,Aaron,Golding
4,John,Bold
5,Markus,M.
6,John,Bald
"))
# string together the fields that might be matching and add to data.frame
df$match <- paste0(trimws(as.character(df$FNAME)),
trimws(as.character(df$LNAME)))
# make an empty list to fill in
possibleDups <- list()
# loop through each row and find matching strings
for(i in 1:nrow(df)){
dups <- agrep(df$match[i], df$match)
if(length(dups) != 1){possibleDups[[i]] <- dups[dups != i]} else {
possibleDups[[i]] <- NA
}
}
# proof - print the list of possible duplicates
print(possibleDups)
> [[1]]
> [1] 2 3
> [[2]]
> [1] 1 3
> [[3]]
> [1] 1 2
> [[4]]
> [1] 6
> [[5]]
> [1] NA
> [[6]]
> [1] 4
如果你只是想重复的字符串列表,你可以使用这个循环,而不是前一个和删除创建一个空表行。
for(i in 1:nrow(df)){
dups = agrep(df$match[i], df$match)
if(length(dups) != 1){df$possibleDups[i] <- paste(dups[dups != i],
collapse = ',')} else {
df$possibleDups[i] <- NA
}
}
print(df)
> id FNAME LNAME match possibleDups
> 1 1 Aaron Golding AaronGolding 2,3
> 2 2 Aaron Golding AaronGolding 1,3
> 3 3 Aaron Golding AaronGolding 1,2
> 4 4 John Bold JohnBold 6
> 5 5 Markus M. MarkusM. <NA>
> 6 6 John Bald JohnBald 4
我认为OP希望向量作为data.frame的元素,而不是逗号分隔的字符串,所以你可以加上'不公开(possibleDups,递归= FALSE)'作为一个新的关口(未经测试) –
还未经测试,可能为了避免循环:'df $ possible_duplicates < - Map(setdiff,lapply(df $ match,agrep,df $ match),1:nrow(df))' –
@Moody_Mudskipper是'Map' from the咕库? – jdbcode