如何读取多头文件？

我有一个文件有多个标题，我也需要标题。如何读取多头文件？

的团长我的文件：

>\>1 Len = 254 

>13 112 1 18 

>15 112 1 30 

>22 11 3 25 

>\>1 Reverse Len = 254 

>14 11 1 15 

>\>2 Len = 186 

>19 15 2 34 

>25 11 3 25 

>....

我怎样才能读取该文件，并导入值代入R参数（如数据帧）？

另外，它的好，如果有人可以帮助我们去除与头和加入表示表的数量（或显示该行是另一个表的第一行）

我不想另一列它读成字符串并解析它

如果有帮助，该数据是从戏子包的报告

，也是我在这里上传了一个例子： http://m.uploadedit.com/ba3c/1429271308686.txt

来源

2015-04-17 ameerosein

通常的做法是使用'readLines'加载文件，然后根据需要将每行转换为字符或数字。搜索一下，你会发现几个类似于你的问题。 –

您可以上传/链接到实际数据文件（.txt，.dat ...）的缩短版本，以便我们可以走了吗？ – steinbock

基本上你必须编写你自己的解析器，如果没有人为这个文件格式做过。 – Roland

最后我用解析的几行代码的数据，并将数据导入成R

我合并所有表到一个表并添加一个新列代表的名称表 ...

就是这样：

lns = readLines("filename.txt") ; # read the data as character 

idx = grepl(">", lns) ; # location of all ">"s 

df = read.table(text=lns[!idx]) ; # read all lines as table unless those who starts with ">" 

wd = diff(c(which(idx), length(idx) + 1)) - 1 ; # finding the index of each table to add in new column 

df$label = rep(lns[idx], wd) ; # add table indices in a new column

和另一种方式来做到这一点特殊的情况是用perl onliner在其他论坛上有人建议我，我不知道它是什么，但它的工作原理：

https://support.bioconductor.org/p/66724/#66767

感谢别人对自己有帮助的答案和评论，可以帮助我得出答案:)

来源

2015-04-19 10:51:48 ameerosein

如果不以字符串形式读取整个事件并解析它，实际上并不容易，但是您可以轻松地将这些操作转换为函数，就像我在my "SOfun" package中使用read.mtable函数所做的那样。

这适用于您的样本数据：

## library(devtools) 
## install_github("mrdwab/SOfun") 

library(SOfun) 
X <- read.mtable("http://m.uploadedit.com/ba3c/1429271308686.txt", ">") 
X <- X[!grepl("Reverse", names(X))] 

names(X) 
# [1] "> 1 Len = 354" "> 2 Len = 127" "> 3 Len = 109" "> 4 Len = 52" 
# [5] "> 5 Len = 1189" "> 6 Len = 1007" "> 7 Len = 918" "> 10 Len = 192" 
# [9] "> 11 Len = 169" "> 13 Len = 248" "> 14 Len = 2500" 
X[1] 
# $`> 1 Len = 354` 
#  V1 V2 V3 V4 
# 1 203757 1 1 35 
# 2 122132 1 1 87 
# 3 203756 1 1 354 
# 4  1 1 1 354 
# 5 42364 12 1 89 
# 6 203757 37 37 91 
# 7 122132 90 90 38 
# 8 42364 102 91 37 
# 9 203757 129 129 168 
# 10 42364 140 129 212 
# 11 122132 129 129 212 
# 12 203757 298 298 43

正如你所看到的，它创造了一个，每个具有的“LEN =”值命名。

这里使用的两个参数是文件位置（这里是一个URL）和chunkID，它们可以设置为一个正则表达式或您想要匹配的固定模式。在这里，我们想匹配以“>”开头的任何行，以指示新数据集的起始位置。

来源

2015-04-17 12:38:24 A5C1D2H2I1M1N2O1R2T1

是否对文件的长度或其他内容有任何约束？我为我的示例文件应用的功能是我的文件的一小部分，它可以正常工作，但是当我将它用于原始文件时，出现错误并说：扫描错误（文件，什么，nmax，sep，dec ，引用，跳过，nlines，na.strings，：第2行没有6个元素 – ameerosein

@ameerosein，不应该有文件长度限制，你提到的错误与'read.table' isn能够正确猜测你的列，空格是唯一的分隔符吗？ – A5C1D2H2I1M1N2O1R2T1

是的...... 正如你在消息中看到的，它是关于第二行......我的例子和原始文件的第二行是相同的。 ..我复制起始行作为例子.. 所以唯一的区别是文件的大小... – ameerosein

或者，如果你想有一个长篇大论繁琐的方法...

# if you just want the data and not the header information 

x<-read.table("1429271308686.txt",comment.char=">") 

# in case all else fails, my somewhat cumbersome solution... 
x<-scan("1429271308686.txt",what="raw") 

# extract the lengths, ind1 has all the lengths 
ind1<-x=="=" 
ind1<-c(ind1[length(ind1)],ind1[-length(ind1)]) # take the value that comes after "=" 
cumsum(ind1) 
lengths<-as.numeric(x[ind1])[c(TRUE,FALSE)] # only want one of the lengths 

# remove the unwanted characters 
ind2<-x==">" 
ind2<-c(ind2[length(ind2)],ind2[-length(ind2)]) # take the value that comes after ">" 

ind3<-x==">"|x=="Len"|x=="="|x=="Reverse" 
dat<-as.numeric(x[!(ind1|ind2|ind3)]) # remove the unwanted 

# arrange as matrix 
mat<-matrix(dat,length(dat)/4,4,byrow=T) 

# the number of rows for each block 
block<-(c(1:length(x))[duplicated(cumsum(!ind2))][c(FALSE,TRUE)]-c(1:length(x))[duplicated(cumsum(!ind2))][c(TRUE,FALSE)]-5)/4 

# the number for each block 
id<-as.numeric(x[ind2])[c(TRUE,FALSE)] 

# new vector 
mat<-cbind(rep(id,block),mat) # note, this assumes that the last line is again "> Reverse"

来源

2015-04-17 13:00:55 steinbock

谢谢，但有一个问题！我们还需要反转表，在我已经上传的例子中，所有反转表都是空的，但其中一些有记录！我更新了我的问题中的小例子，请看 – ameerosein

，第二个问题是如果任何反转表不是空的，那么所有的索引都会从那个地方出错到最后... – ameerosein

如何读取多头文件？

回答

相关问题