用R解析XML文件的目录

我正在解析从ClinicalTrials.gov下载的xml文件的目录，并且无法提取数据。我可以为单个文件（下面的NCT00006435.xml）执行此操作，但无法弄清楚如何为多个文件执行此操作。用R解析XML文件的目录

library(XML) 
# Download ct.gov query and extract xml files 
ct<-tempfile() 
dir.create("ctdir") 
url<-"https://clinicaltrials.gov/search?term=neurofibromatosis-type-1&studyxml=true" 
download.file(url, ct) 
unzip(ct, exdir="ctdir") 
files<-list.files("ctdir") 
# Change the working directory so we don't have to worry about the filepath 
setwd("ctdir") 

# Extract data from one file and get it into a data frame 
#xmlfile<-xmlTreeParse("NCT00006435.xml") 
#xmltop<-xmlRoot(xmlfile) 
#tags<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) 
#tags_df<-data.frame(t(tags),row.names=NULL) 

# Extract data from each file and get it into a data frame 
xmlfiles<-lapply(files,function(x) xmlTreeParse(x)) 
xmltop<-lapply(xmlfiles,function(x) xmlRoot(x)) 
tags<-???

如何运行文件列表，循环显示每个文件中的每个标记？

来源

2016-03-03 user1357079

您需要实际下载单个文件。 'xmlTreeParse（）'在_local_文件上运行以提取XML。目前，我相信'files'只是包含一个匹配的文件名列表，因为它们出现在服务器上。 –

另外'xmlTreeParse（）'不会自动迁移到数据框，但需要'xmlToDataFrame（）'。发布示例xml会很有帮助。 – Parfait

Arrgh。 'object.size（xmltop）＃40 196 696 bytes'。我们可以有一个“最小”的例子吗？你对'标签'含义的理解是什么？ –

STR（xmltop）的顶部看起来像：

List of 107 
$ :List of 40 
    ..$ comment    : Named list() 
    .. ..- attr(*, "class")= chr [1:5] "XMLCommentNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    ..$ required_header  :List of 3 
    .. ..$ download_date:List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    .. ..$ link_text :List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    .. ..$ url   :List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    ..$ id_info    :List of 4 
    .. ..$ org_study_id:List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"

所以这是一个列表，你可以在“循环”了它的顶级用一个简单的lapply。如果您想使用您的单节点案例代码，它只是：

tags<-lapply(xmltop, function(x) xmlSApply(x, xmlValue)) 
object.size(tags) 
1618008 bytes

仍然是一个相当不方便的对象。我重申我的建议，你会找到一个更易于管理的例子。

来源

2016-03-03 02:13:44

只是将你的代码包装在一个函数中。

tags_df <- function(file){ 
    message("Loading ", file) 
    #your code 
    xmlfile<-xmlTreeParse(file) 
    xmltop<-xmlRoot(xmlfile) 
    tags_l<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) 
    tags<-data.frame(t(tags_l),row.names=NULL) 
    tags 
} 

tags<- lapply(files, tags_df)

既然你有一对多的位置，关键字等标签，结合data.frames将返回一个混乱与260列的下方，包括location.1到location.120。我会用一些特定的xpath查询代替你的代码，以便将你真正想要的标签变成可理解的格式。

x <- ldply(tags, "data.frame") 
names(x)

来源

2016-03-03 18:44:29

用R解析XML文件的目录

回答

相关问题