2014-11-24 78 views
4

如果列X和Y相等(我必须匹配dOne.X == dTwo.X & dOne.Y == dTwo.Y以及dOne.X == dTwo.Y & dOne.Y == dTwo.X)我试图在另一列中“合并”数据帧的列V使用for循环解决了这个问题,但是当Data Frame dOne很大时(在我的机器上,如果length(dOne.X) == 500000需要25分钟)它会很慢。我想知道是否有办法使用更快的“矢量化”操作来解决此问题。完成后通过匹配列来合并具有不同大小的两个数据帧

Data Frame ONE 
X Y V 
a b 2 
a c 3 
a d 0 
a e 0 
b c 2 
b d 3 
b e 0 
c d 2 
c e 0 
d e 0 

Data Frame TWO 
X Y V 
a b 1 
a c 1 
a d 1 
b c 1 
b d 1 
c d 1 
e d 1 

Expected Data Frame after the columns are merged 
X Y V V2 
a b 2 1 
a c 3 1 
a d 0 1 
a e 0 0 
b c 2 1 
b d 3 1 
b e 0 0 
c d 2 1 
c e 0 0 
d e 0 1 

这是我使用至今的代码是缓慢的大(几十万行)::以上是我所想要做的个例

copyadjlistValueColumn <- function(dOne, dTwo) { 
    dOne$V2 <- 0 

    lv <- union(levels(dOne$Y), levels(dOne$X)) 

    dTwo$X <- factor(dTwo$X, levels = lv) 
    dTwo$Y <- factor(dTwo$Y, levels = lv) 
    dOne$X <- factor(dOne$X, levels = lv) 
    dOne$Y <- factor(dOne$Y, levels = lv) 

    for(i in 1:nrow(dTwo)) { 
     row <- dTwo[i,] 
     dOne$V2[dOne$X == row$X & dOne$Y == row$Y] <- row$V 
     dOne$V2[dOne$X == row$Y & dOne$Y == row$X] <- row$V 
    } 
    dOne 
} 

这是一个涵盖我所期望的测试案例(使用上面的数据框):

test_that("Copy V column to another Data Frame", { 
    dfOne <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"), 
         Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"), 
         V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0)) 

    dfTwo <- data.frame(X=c("a", "a", "a", "b", "b", "c", "e"), 
         Y=c("b", "c", "d", "c", "d", "d", "d"), 
         V=c(1, 1, 1, 1, 1, 1, 1)) 

    lv <- union(levels(dfTwo$Y), levels(dfTwo$X)) 
    dfExpected <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"), 
          Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"), 
          V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0), 
          V2=c(1, 1, 1, 0, 1, 1, 0, 1, 0, 1)) 
    dfExpected$X <- factor(dfExpected$X, levels = lv) 
    dfExpected$Y <- factor(dfExpected$Y, levels = lv) 

    dfMerged <- copyadjlistValueColumn(dfOne, dfTwo) 

    expect_identical(dfMerged, dfExpected) 
}) 

任何建议吗?

感谢很多:)

+0

可能重复[如何在R(内部,外部,左侧,右侧)连接数据框?](http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left -right) – 2014-11-24 13:17:01

+0

'merge(dOne,dTwo,by = c(“X “,”Y“),all.x = TRUE)?虽然由于某种原因,它不完全符合你想要的输出 – 2014-11-24 13:19:12

+1

嘿大卫,我认为这是因为我必须以“双向”方式匹配它:'dOne.X == dTwo.X&dOne.Y == dTwo。 Y'和'dOne.X == dTwo.Y&dOne.Y == dTwo.X' – alfakini 2014-11-24 13:21:26

回答

2

要做两merge,其中的匹配列的顺序在第二逆转,拿到了“双向”匹配。然后你可以使用例如rowSums将两个创建的列合并为一个。

d1 <- merge(dfOne, dfTwo, by.x = c("X", "Y"), by.y = c("X", "Y"), all.x = TRUE) 
d2 <- merge(d1, dfTwo, by.x = c("X", "Y"), by.y = c("Y", "X"), all.x = TRUE) 
cbind(dfOne, V2 = rowSums(cbind(d2$V.y, d2$V), na.rm = TRUE)) 


# X Y V V2 
# 1 a b 2 1 
# 2 a c 3 1 
# 3 a d 0 1 
# 4 a e 0 0 
# 5 b c 2 1 
# 6 b d 3 1 
# 7 b e 0 0 
# 8 c d 2 1 
# 9 c e 0 0 
# 10 d e 0 1 

为了更快的替代merge,这里检查data.tabledplyr备选方案:stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right/

+1

非常感谢Henrik :) – alfakini 2014-11-25 00:53:20

+0

@alf。,请注意,我在一开始就做了一个小改动,以摆脱re'name'。 – Henrik 2014-11-25 08:04:36

2

这里有一个可能的data.table包方法。这种方法应该是一个大的数据集,像你这样的特别有效的有:

首先转换为data.table对象并添加密钥

library(data.table) 
setkey(setDT(dfOne), X, Y) 
setkey(setDT(dfTwo), X, Y) 

然后对X & Y组合联接 - 通过匹配键列进行联接X,YdfOne分别具有dfTwo的键列X,Y

dfOne[dfTwo, V2 := i.V] 

立即执行上Y & X组合的加入 - 联接通过匹配键列的dfOneX,Y与键列分别的dfTwoY,X进行。

setkey(dfTwo, Y, X) 
dfOne[dfTwo, V2 := i.V][] 

结果(我会保持无与伦比的NA s,而不是零,因为它更有意义这样):

#  X Y V V2 
# 1: a b 2 1 
# 2: a c 3 1 
# 3: a d 0 1 
# 4: a e 0 NA 
# 5: b c 2 1 
# 6: b d 3 1 
# 7: b e 0 NA 
# 8: c d 2 1 
# 9: c e 0 NA 
# 10: d e 0 1 
+1

非常好的更新.. – Arun 2014-11-24 17:56:42

+0

@alf。你有没有尝试过这种方法?没有得到你的任何反馈。 – 2014-11-25 08:28:38

1

随着dplyr

library(dplyr) 

left_join(dfOne, dfTwo, by = c("X", "Y")) %>% 
    left_join(dfTwo, by = c("X" = "Y", "Y" = "X")) %>% 
    mutate(V2 = ifelse(is.na(V.y), V, V.y)) %>% 
    select(X, Y, V = V.x, V2) %>% 
    do(replace(., is.na(.), 0)) 
+0

谢谢junkka!我会坚持Henrik的答案,因为它更容易理解。 – alfakini 2014-11-25 01:09:56

相关问题