2015-11-02 45 views
0

我想解析来自NCBI xml文件的某些子节点的xmlValue。但是,对于某些PM.ID,Root node <PubmedArticleSet>具有不同的信息w.r.t公开的记录,PubmedBookArticlePubmedArticle。我想通过一个条件,if(xmlName(fetch.pubmed) == PubmedBookArticle提取某些值elseif (xmlName(fetch.pubmed) == PubmedArticle提取其他值。最后,制作一个dataframe,这两个值都对应于它们的PMID。这看似简单,但(xmlName(fetch.pubmed)抛出错误no applicable method for 'xmlName' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"任何帮助表示赞赏,谢谢如何访问XML文件中具有不同名称的子节点(子)的值?

<?xml version="1.0"?> 
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2015//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_150101.dtd"> 
<PubmedArticleSet> 
    <PubmedBookArticle> 
    <BookDocument> 
     <PMID Version="1">25506969</PMID> 
     <ArticleIdList> 
     <ArticleId IdType="bookaccession">NBK259188</ArticleId> 
     </ArticleIdList> .... 

    ...... </BookDocument> 
    </PubmedBookArticle> 

    <PubmedArticle> 
    <MedlineCitation Status="Publisher" Owner="NLM"> 
     <PMID Version="1">25013473</PMID> 
     <DateCreated> 
     <Year>2014</Year> 
     <Month>7</Month> 
     <Day>11</Day> 
     </DateCreated>.... 

    ....</MedlineCitation> 
    </PubmedArticle> 
</PubmedArticleSet> 

我的代码如下

library(XML) 
library(rentrez) 

PM.ID <- c("25506969"," 25032371"," 24983039","24983034","24983032","24983031", 
"26386083","26273372","26066373","25837167", 
"25466451","25013473") 
# rentrez function to retrieve XMl file for above PIMD 
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID, 
          rettype = "xml", parsed = T) 
# If empty records, return NA 
FindNull <- function(x,x1child){ 
    res <- xpathSApply(x,x1child,xmlValue) 
    if (length(res) == 0){ 
    out <- NA 
    }else { 
    out <- res 
    } 
    out 
} 

# extract contents from xml file 
    xpathSApply(fetch.pubmed,"//PubmedArticle",FindNull,x1child = './/ArticleTitle') 

    xpathSApply(fetch.pubmed,"//PubmedBookArticle",FindNull,x1child = './/BookTitle') 

如何让上面的代码在一个循环,这样我可以检索值在PubmedArticle和PubmedBookArticle中作为条件满足每个搜索?

回答

1

有几种方法可以做到这一点,但我也许会为书籍和文章获得单独的节点集。

table(xpathSApply(fetch.pubmed, "/PubmedArticleSet/*", xmlName)) 
    PubmedArticle PubmedBookArticle 
       6     6 

books <- getNodeSet(fetch.pubmed, "/PubmedArticleSet/PubmedBookArticle") 

data.frame(pmid = sapply(books, function(x) xpathSApply(x, ".//PMID", xmlValue)), 
      title = sapply(books, function(x) xpathSApply(x, ".//BookTitle", xmlValue)) 
) 

     pmid                          title 
1 25506969              Probe Reports from the NIH Molecular Libraries Program 
2 25032371              Understanding Climate’s Influence on Human Evolution 
3 24983039 Assessing the Effects of the Gulf of Mexico Oil Spill on Human Health: A Summary of the June 2010 Workshop 
4 24983034             In the Light of Evolution: Volume IV: The Human Condition 
5 24983032           The Role of Human Factors in Home Health Care: Workshop Summary 
+0

谢谢克里斯。这绝对有帮助。我想,分开提取书籍和文章更符合你的建议。我尝试了一个for循环,它只会减慢并且使进程复杂化。 – user5249203

+0

有时,您可以使用像xpathSApply(fetch.pubmed,c(“// BookTitle”,“// ArticleTitle”),xmlValue)这样的矢量来搜索两个不同的名称,但第一个结果有一个BookTitle和一个ArticleTitle,所以它更容易与节点一起工作。 –

+0

或者'xpathSApply(fetch.pubmed,c(“// BookTitle”,“// Article/ArticleTitle”),xmlValue)' –

0
  • 下面NCBI XML路径有助于从PubmedArticlePubmedBookArticle提取abstracts和以及那些文章without abstracts (NA)

    <!-- language: lang-r --> 
    abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article', 
         '//PubmedBookArticle//Abstract'), function(x) { 
    xmlValue(xmlChildren(x)$Abstract) }) 
    abstracts <- data.frame(abstracts,stringsAsFactors = F) 
    dim(abstracts) 
    rownames(abstracts) <- PM.ID 
    
相关问题