按行应用自定义功能data.table返回值

我有点新data.tables的量不正确，我有一个包含表DNA基因组坐标是这样的：按行应用自定义功能data.table返回值

 chrom pause strand coverage 
    1:  1 3025794  +  1 
    2:  1 3102057  +  2 
    3:  1 3102058  +  2 
    4:  1 3102078  +  1 
    5:  1 3108840  -  1 
    6:  1 3133041  +  1

我写了一个定制函数，我想应用到我的大约2百万行表中的每一行，它使用GenomicFeatures的mapToTranscripts以字符串和新坐标的形式检索两个相关的值。我想将它们添加到我的表在两个新列，就像这样：

 chrom pause strand coverage  transcriptID CDS 
    1:  1 3025794  +  1 ENSMUST00000116652 196 
    2:  1 3102057  +  2 ENSMUST00000116652 35 
    3:  1 3102058  +  2 ENSMUST00000156816 888 
    4:  1 3102078  +  1 ENSMUST00000156816 883 
    5:  1 3108840  -  1 ENSMUST00000156816 882 
    6:  1 3133041  +  1 ENSMUST00000156816 880

的功能如下：

get_feature <- function(dt){ 

     coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
     hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
     tx_id <- tx_names[as.character(seqnames(hit))] 
     cds_coordinate <- sapply(ranges(hit), '[[', 1) 

     if(length(tx_id) == 0 || length(cds_coordinate) == 0) { 
     out <- list('NaN', 0) 
     } else { 
     out <- list(tx_id, cds_coordinate) 
     } 

     return(out) 
    }

然后，我做的：

counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]

我得到此错误，指示该函数返回两个比原始表格长度更短的列表，而不是每行一个新元素：

Warning messages: 
    1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... : 
     Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items). 
    2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... : 
     Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

我假定使用.I运算符将按行应用函数并返回每行一个值。如果使用语句，我还确认该函数未使用返回空值。

然后我试图函数的这个版本的模拟：

get_feature <- function(dt) { return('I should be returned once for each row') }

并把它称为是这样的：

new.table <- counts[, get_feature(.SD), by = .I]

它使一个1行数据表，而不是一个原始长度。所以我得出结论：我的功能，或者我称之为的方式，是以某种方式崩溃了结果向量的元素。我究竟做错了什么？

更新（用溶液）：作为@StatLearner指出的那样，它在this answer说明的是，如?data.table解释的，.I仅旨在用于j（如在DT[i,j,by=]）。因此，等同于by=NULL，正确的语法是by=1:nrow(dt)，以便按行编号并逐行应用该函数。

不幸的是，对于我的特殊情况，这是完全没有效率的，我计算了100行的执行时间为20秒。我的3600万行数据集需要3个月才能完成。

就我而言，我不得不放弃并在整个桌子上使用mapToTranscripts函数，这需要几秒钟的时间，显然是预期的用途。

get_features <- function(dt){ coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript tx_hit <- as.character(seqnames(hits)) # get transcript number tx_id <- tx_names[tx_hit] # get transcript name from translation table return(data.table('transcriptID'= tx_id, 'CDS_coordinate' = start(hits)) } density <- counts[, get_features(.SD)]

然后映射回用mapFromTranscripts从GenomicFeatures包，所以我可以用一个data.tables连接来得到从原始表，这是什么，我要怎样做预期目的信息的基因组。

来源

2017-01-10 Antonio Martínez

当我需要为数据中的每一行应用函数时，我这样做。表由行号分组它：

counts[, get_feature(.SD), by = 1:nrow(counts)]

如所解释的在this answer，.I不旨在用于使用by因为它应该返回由分组产生行索引的序列。 by = .I不会引发错误的原因是data.table在data.table命名空间中创建的对象.I等于NULL，因此by = .I等效于by = NULL。

注意，使用by=1:nrow(dt)组由行号，可以让你的函数从data.table访问只有一行：

require(data.table) 
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE), 
        pause = sample((3 * 10^6):(3.2 * 10^6), size = 100), 
        strand = sample(c('-','+'), size = 100, replace = TRUE), 
        coverage = sample.int(3, size = 100, replace = TRUE)) 

get_feature <- function(dt){ 
    coordinate <- data.frame(dt$chrom, dt$pause, dt$strand) 
    rowNum <- nrow(coordinate) 
    return(list(text = 'Number of rows in dt', rowNum = rowNum)) 
} 

counts[, get_feature(.SD), by = 1:nrow(counts)]

会产生data.table具有相同的行数，如counts，但coordinate将包含从counts

nrow     text rowNum 
1: 1 Number of rows in dt  1 
2: 2 Number of rows in dt  1 
3: 3 Number of rows in dt  1 
4: 4 Number of rows in dt  1 
5: 5 Number of rows in dt  1

一个单行而by = NULL将供应整个data.table的功能：

counts[, get_feature(.SD), by = NULL] 

        text rowNum 
1: Number of rows in dt 100

这是by工作的预期方式。

来源

2017-01-10 08:55:01 StatLearner

不错的答案@StatLearner。欢迎来到SO！ – rosscova

确实如此，@StatLearner，我用'by = NULL'检查，结果是一样的。使用'by = 1：NROW（dt）'按照我想要的方式应用函数，但速度非常慢，所以我不得不寻找另一种解决方法。我不能按照我想要的方式使用这个函数，但今天我学到了很多关于'data.tables'的知识，非常感谢！ –

PS：有趣的是，我从[这个其他答案]得到'by = .I'主意（http://stackoverflow.com/questions/25431307/r-data-table-apply-function-to-rows -using-columns-as-arguments），这是第一个通过谷歌搜索_“将函数应用于每行数据表”_出现的。我会把你提到的答案连接起来，以防有人得到和我一样的想法。 –

按行应用自定义功能data.table返回值

回答

相关问题