通过R中变量的模糊匹配进行合并

我有两个数据帧（x & y），其中ID是student_name,father_name和mother_name。由于印刷错误（“n”而不是“m”，随机空格等），我有大约60％的值没有对齐，尽管我可以观察数据并看到他们应该这样做。有没有办法降低不匹配的不匹配程度，以便手动编辑，因为至少可行？数据框有大约700K的观测值。通过R中变量的模糊匹配进行合并

R将是最好的。我知道一些python和一些基本的unix工具。附：我读到agrep()，但不明白如何可以在实际数据集上工作，特别是当匹配超过一个变量时。

更新（张贴恩惠的数据）：

Here是两个示例数据帧，sites_a和sites_b。它们可以在数字列lat和lon以及sitename列中匹配。了解如何在a）lat + lon，b）sitename或c）两者上完成这将是有用的。

您可以输入发布为要点的文件test_sites.R。

理想情况下，答案将与

merge(sites_a, sites_b, by = **magic**)

来源

2011-08-19 user702432

你能否提供一小部分数据（或提供给我们一些假数据）？ –

@RomanLuštrik虽然这本来不是我的问题，但我有类似的问题，创建了一些示例数据，并提供了奖励。 –

@David你有没有试过merge（sites_a，sites_b，by = c（“lon”，“lat”））'？在你的情况下，如果你想按名称合并，你将不得不投入更多的精力来使两个data.frames中的名字匹配（祝你好运，呵呵）。在示例中为 –

的agrep函数（基础R的一部分），它不使用Levenshtein edit distance近似串匹配可能是值得尝试结束。在不知道你的数据是什么样的情况下，我无法真正提出一个可行的解决方案。但这是一个建议......它在单独的列表中记录匹配项（如果有多个同样好的匹配项，那么这些匹配项也会被记录下来）。比方说，你data.frame被称为df：

l <- vector('list',nrow(df)) 
matches <- list(mother = l,father = l) 
for(i in 1:nrow(df)){ 
    father_id <- with(df,which(student_name[i] == father_name)) 
    if(length(father_id) == 1){ 
    matches[['father']][[i]] <- father_id 
    } else { 
    old_father_id <- NULL 
    ## try to find the total                                 
    for(m in 10:1){ ## m is the maximum distance                            
     father_id <- with(df,agrep(student_name[i],father_name,max.dist = m)) 
     if(length(father_id) == 1 || m == 1){ 
     ## if we find a unique match or if we are in our last round, then stop                    
     matches[['father']][[i]] <- father_id 
     break 
     } else if(length(father_id) == 0 && length(old_father_id) > 0) { 
     ## if we can't do better than multiple matches, then record them anyway                    
     matches[['father']][[i]] <- old_father_id 
     break 
     } else if(length(father_id) == 0 && length(old_father_id) == 0) { 
     ## if the nearest match is more than 10 different from the current pattern, then stop                 
     break 
     } 
    } 
    } 
}

代码为mother_name将基本相同。你甚至可以把它们放在一个循环中，但这个例子仅仅是为了说明。

来源

2011-08-19 10:42:17 nullglob

谢谢，nullglob。你能否使用数据框（x，y）和变量（student_name等）来解释说明。 – user702432

对不起@ user702432，我没有仔细阅读你的问题。你已经找到了'agrep'。我已经添加了关于如何将其与数据结合使用的建议。框架 – nullglob

这需要公共列名的列表，匹配基于所有这些列组合的agrep，然后如果all.x或all.y等于TRUE其追加的非匹配记录与NA缺少的列填充。与merge不同，需要在每个数据帧中匹配的列名相同。挑战似乎是正确设置agrep选项以避免虚假匹配。

agrepMerge <- function(df1, df2, by, all.x = FALSE, all.y = FALSE, 
    ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) { 

    df1$index <- apply(df1[,by, drop = FALSE], 1, paste, sep = "", collapse = "") 
    df2$index <- apply(df2[,by, drop = FALSE], 1, paste, sep = "", collapse = "") 

    matches <- lapply(seq_along(df1$index), function(i, ...) { 
     agrep(df1$index[i], df2$index, ignore.case = ignore.case, value = value, 
      max.distance = max.distance, useBytes = useBytes) 
    }) 

    df1_match <- rep(1:nrow(df1), sapply(matches, length)) 
    df2_match <- unlist(matches) 

    df1_hits <- df1[df1_match,] 
    df2_hits <- df2[df2_match,] 

    df1_miss <- df1[setdiff(seq_along(df1$index), df1_match),] 
    df2_miss <- df2[setdiff(seq_along(df2$index), df2_match),] 

    remove_cols <- colnames(df2_hits) %in% colnames(df1_hits) 

    df_out <- cbind(df1_hits, df2_hits[,!remove_cols]) 

    if(all.x) { 
     missing_cols <- setdiff(colnames(df_out), colnames(df1_miss)) 
     df1_miss[missing_cols] <- NA 
     df_out <- rbind(df_out, df1_miss) 
    } 
    if(all.x) { 
     missing_cols <- setdiff(colnames(df_out), colnames(df2_miss)) 
     df2_miss[missing_cols] <- NA 
     df_out <- rbind(df_out, df2_miss) 
    } 
    df_out[,setdiff(colnames(df_out), "index")] 
}

来源

2013-04-27 21:54:32 SchaunW

将此置于Gist中：https://gist.github.com/enricoferrero/0e41549d437aeda4d5f2f95116316c00 – enricoferrero

通过R中变量的模糊匹配进行合并

回答

相关问题