查找具有接近重复值的行的索引

我遇到了在数据集中找到重复行附近的问题。对于我的数据，我必须添加“POSSIBLE_DUPLICATES”列，它应该包含可能的重复索引。数据不仅包含字段FNAME和LNAME，还包含其他一些信息，也可用于查找重复信息。查找具有接近重复值的行的索引

| id | FNAME | LNAME | POSSIBLE_DUPLICATES | 
|----|--------|---------|---------------------| 
| 1 | Aaron | Golding | 2,3     | 
| 2 | Aroon | Golding | 1,3     | 
| 3 | Aaron | Golding | 2,1     | 
| 4 | John | Bold | 6     | 
| 5 | Markus | M.  |      | 
| 6 | John | Bald | 4     |

我试图找到AGREP indicies（）函数，但我不太懂，我怎么能调用它的多个列，以及如何Concat的所有行indicies。任何帮助将不胜感激。

来源

2017-08-02 Евгений М

这是一个在添加字段（“匹配”）上使用agrep的示例，它是要用于识别重复项的所选字段的拼接（根据需要添加其他字段）。在这个例子中，列表索引对应于data.frame的行。

# make a mock data.frame 
df <- read.csv(textConnection(" 
id,FNAME,LNAME 
1,Aaron,Golding 
2,Aroon,Golding 
3,Aaron,Golding 
4,John,Bold 
5,Markus,M. 
6,John,Bald 
")) 

# string together the fields that might be matching and add to data.frame 
df$match <- paste0(trimws(as.character(df$FNAME)), 
    trimws(as.character(df$LNAME))) 

# make an empty list to fill in 
possibleDups <- list() 

# loop through each row and find matching strings 
for(i in 1:nrow(df)){ 
    dups <- agrep(df$match[i], df$match) 
    if(length(dups) != 1){possibleDups[[i]] <- dups[dups != i]} else { 
    possibleDups[[i]] <- NA 
    } 
} 

# proof - print the list of possible duplicates 
print(possibleDups) 

> [[1]] 
> [1] 2 3 

> [[2]] 
> [1] 1 3 

> [[3]] 
> [1] 1 2 

> [[4]] 
> [1] 6 

> [[5]] 
> [1] NA 

> [[6]] 
> [1] 4

如果你只是想重复的字符串列表，你可以使用这个循环，而不是前一个和删除创建一个空表行。

for(i in 1:nrow(df)){ 
    dups = agrep(df$match[i], df$match) 
    if(length(dups) != 1){df$possibleDups[i] <- paste(dups[dups != i], 
    collapse = ',')} else { 
    df$possibleDups[i] <- NA 
    } 
} 

print(df) 

> id FNAME LNAME  match possibleDups 
> 1 1 Aaron Golding AaronGolding   2,3 
> 2 2 Aaron Golding AaronGolding   1,3 
> 3 3 Aaron Golding AaronGolding   1,2 
> 4 4 John Bold  JohnBold   6 
> 5 5 Markus  M.  MarkusM.   <NA> 
> 6 6 John Bald  JohnBald   4

来源

2017-08-02 23:32:36 jdbcode

我认为OP希望向量作为data.frame的元素，而不是逗号分隔的字符串，所以你可以加上'不公开（possibleDups，递归= FALSE）'作为一个新的关口（未经测试） –

还未经测试，可能为了避免循环：'df $ possible_duplicates < - Map（setdiff，lapply（df $ match，agrep，df $ match），1：nrow（df））' –

@Moody_Mudskipper是'Map' from the咕库？ – jdbcode

查找具有接近重复值的行的索引

回答

相关问题