2016-12-06 50 views
2

我有加载为数据帧到R.基因组一个bed file坐标,看起来很喜欢这样的:合并一些行到一个当数据是连续

chrom start end 
chrX 400 600 
chrX 800 1000 
chrX 1000 1200 
chrX 1200 1400 
chrX 1600 1800 
chrX 2000 2200 
chrX 2200 2400 

有没有必要把所有的行它会更好地压缩它到这样的事情:

chrom start end 
chrX 400 600 
chrX 800 1400 
chrX 1600 1800 
chrX 2000 2400 

我怎么可能做到这一点?

我试过想用dplyr但是没有成功。 group_by将无法​​正常工作,因为我不知道如何使用第一行的开始坐标和最后一行的结束坐标将连续行的块修改为一个,因为这些块中有很多。

回答

2

使用Bioconductor的从包GenomicRanges,特别是睡觉的文件建立和类似:

library(GenomicRanges) 

# Example data 
gr <- GRanges(
    seqnames = Rle("chr1", 6), 
    ranges = IRanges(start = c(400 ,800, 1200, 1400, 1800, 2000), 
        end = c(600, 1000, 1400, 1600, 2000, 2200))) 
gr 
# GRanges object with 6 ranges and 0 metadata columns: 
#  seqnames  ranges strand 
#   <Rle> <IRanges> <Rle> 
# [1]  chr1 [ 400, 600]  * 
# [2]  chr1 [ 800, 1000]  * 
# [3]  chr1 [1200, 1400]  * 
# [4]  chr1 [1400, 1600]  * 
# [5]  chr1 [1800, 2000]  * 
# [6]  chr1 [2000, 2200]  * 
# ------- 
# seqinfo: 1 sequence from an unspecified genome; no seqlengths 

# merge contiouse ranges into one using reduce: 
reduce(gr) 
# GRanges object with 4 ranges and 0 metadata columns: 
#  seqnames  ranges strand 
#   <Rle> <IRanges> <Rle> 
# [1]  chr1 [ 400, 600]  * 
# [2]  chr1 [ 800, 1000]  * 
# [3]  chr1 [1200, 1600]  * 
# [4]  chr1 [1800, 2200]  * 
# ------- 
# seqinfo: 1 sequence from an unspecified genome; no seqlength 

# EDIT: if the bed file is a data.frame we can convert it to ranges object: 
gr <- GRanges(seqnames(Rle(df$chrom), 
         ranges = IRanges(start = df$start, 
             end = df$end))) 
相关问题