根据不同长度的字符串操纵两个数据帧

我在这里问了一个问题Finding the index based on two data frames of strings，我得到了一个完美的答案。现在我一直面临着另一个我无法解决的问题。如果我的第二个数据是多列，然后我就可以解决它的基础上根据不同长度的字符串操纵两个数据帧

setDT(strs)[, c('colids1','colids2') := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][]

只要这是确定作为我的第二个数据序列（STR）在所有列长度相同，但如果他们改变（不相同的长度），那么这是行不通的，并给我一个错误。

所以我们说，我的第一个数据是

lut <- structure(list(V1 = c("O75663", "O95400", "O95433", NA, NA), 
    V2 = c("O95456", "O95670", NA, NA, NA), V3 = c("O75663", 
    "O95400", "O95433", "O95456", "O95670"), V4 = c("O95456", 
    "O95670", "O95801", "P00352", NA), V1 = c("O75663", "O95400", 
    "O95433", NA, NA), V2 = c("O95456", "O95670", NA, NA, NA), 
    V3 = c("O75663", "O95400", "O95433", "O95456", "O95670"), 
    V4 = c("O95456", "O95670", "O95801", "P00352", NA)), .Names = c("V1", 
"V2", "V3", "V4", "V1", "V2", "V3", "V4"), row.names = c(NA, 
-5L), class = "data.frame")

和我的第二个数据是

strs <- structure(list(strings = structure(c(2L, 3L, 4L, 5L, 6L, 7L, 
1L, 1L), .Label = c("", "O75663", "O95400", "O95433", "O95456", 
"O95670", "O95801"), class = "factor"), strings2 = structure(c(4L, 
2L, 6L, 5L, 3L, 1L, 1L, 1L), .Label = c("", "O75663", "O95433", 
"O95456", "P00352", "P00492"), class = "factor"), strings3 = structure(c(4L, 
6L, 7L, 8L, 2L, 3L, 5L, 1L), .Label = c("", "O75663", "O95400", 
"O95456", "O95670", "O95801", "P00352", "P00492"), class = "factor"), 
    strings4 = structure(c(2L, 5L, 3L, 4L, 1L, 1L, 1L, 1L), .Label = c("", 
    "O95400", "O95456", "O95801", "P00492"), class = "factor"), 
    strings5 = structure(c(8L, 2L, 7L, 1L, 3L, 6L, 5L, 4L), .Label = c("O75663", 
    "O95400", "O95433", "O95456", "O95670", "O95801", "P00352", 
    "P00492"), class = "factor")), .Names = c("strings", "strings2", 
"strings3", "strings4", "strings5"), class = "data.frame", row.names = c(NA, 
-8L))

这就是我试图做

df<- setDT(strs)[, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][]

它的工作原理，如果长度strs是相同的，但它不起作用，当长度变化时，我给这里的例子

来源

2016-07-06 nik

错误很明显。试试这个'strs [c（1：3,5）] < - lapply（strs [c（1：3,5）]，as.character）'然后运行你的'data.table'语句。由此产生的'df'是否符合您的期望？ – Sumedh

@Sumedh谢谢你的消息，它不能解决问题。我做了你所说的然后我做了df < - setDT（strs）[，paste0（'colids _'，seq_along（strs））：= lapply（.SD，function（x）toString（which（colSums（lut == x，na.rm = TRUE）> 0））），by = 1：nrow（strs）] []然后得到同样的错误。 – nik

@Sumedh我一直在尝试在网络上提供的每一个评论，但我不知道为什么它不工作！ – nik

在strs到字符变量转换你的因子变量，也可以很容易地与data.table完成。假设你strs数据集已经是一个data.table，你应该做的：

strs[, names(strs) := lapply(.SD, as.character)]

如果strs还不是data.table，你应该使用：

setDT(strs)[, names(strs) := lapply(.SD, as.character)]

之后，你可以像执行操作你自找的。一切链接在一起，它看起来像：

setDT(strs)[, lapply(.SD, as.character) 
      ][, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), 
       by = 1:nrow(strs)][]

来源

2016-07-06 15:37:08 Jaap

非常感谢你的宝贵意见，我已经喜欢你的答案，太棒了！有可能看看我的真实数据吗？一旦你看，我可以从网上删除它们。谢谢 – nik

@nik我已经在寻找;-) – Jaap

很好，谢谢兄弟，我也接受了你的回答，因为它非常丰富，我从中学到很多东西。再次感谢 – nik

这个我从@scentoni倾斜，rapply是lapply的递归版本它将所有的向量转换为字符。如果它被设置为替换如何=“替换”，那么列表中不是列表并且具有类中包括的类的列表中的每个元素被替换为应用函数的结果，其中是as.character here to the element。

strs <- rapply(strs, as.character, classes="factor", how="replace")

然后执行

df<- setDT(strs)[, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][]

来源

2016-07-06 13:45:50 Learner

这一个也适用！你能评论一下这个功能吗？ – nik

感谢它工作 – nik

根据不同长度的字符串操纵两个数据帧

回答

相关问题