2016-07-28 62 views
0

XML数据 其中R解析XML文件获取到的数据帧

<HealthData locale="en_US"> 
<ExportDate value="2016-06-02 14:05:23 -0400"/> 
<Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/> 
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/> 
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/> 
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/> 
</HealthData> 

> library(XML) 
> doc="\\pathtoXMLfile" 
> list <-xpathApply(doc, "//HealthData/Record", xmlAttrs) 
> df <- do.call(rbind.data.frame, list) 
> str(df) 

我试图采取上面所示的XML数据样本并将其加载到一个数据帧R代码R与每个记录的名称即类型,sourceName,单位,endDate,值作为列标题和每个记录值即计数,2014-09-24 15:07:11 -0400,7作为每行的值在数据帧。

df <- do.call(rbind.data.frame, list)这个关闭,但它也看起来像它绑定列标题的所有值也。如果你View(df)str(df)你会明白我的意思。如何使用Record变量名称作为列标题名称?

感谢, 瑞安

回答

1

考虑xpathSApply()检索属性,然后用t()调换结果列表到数据帧:

library(XML) 

xmlstr <- '<?xml version="1.0" encoding="UTF-8"?> 
      <HealthData locale="en_US"> 
       <ExportDate value="2016-06-02 14:05:23 -0400"/> 
       <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/> 
      </HealthData>' 

xml <- xmlParse(xmlstr) 

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs) 
df <- data.frame(t(recordAttribs)) 
df 

#        type    sourceName unit 
# 1 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
# 2 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
# 3 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
#    creationDate     startDate     endDate 
# 1 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400 
# 2 2014-10-02 08:30:17 -0400 2014-09-24 15:12:13 -0400 2014-09-24 15:12:18 -0400 
# 3 2014-10-02 08:30:17 -0400 2014-09-24 15:17:16 -0400 2014-09-24 15:17:21 -0400 
# value 
# 1  7 
# 2 15 
# 3 20 

在属性的情况下,出现在一些不其他人则考虑与预先确定的名称列表进行匹配,并反复填写NAs。下面是使用sapply()for环和第二list参数两个版本:

recordnames <- c("type", "unit", "sourceName", "device", "sourceVersion", 
       "creationDate", "startDate", "endDate", "value") 

# FOR LOOP VERSION 
recordAttribs <- sapply(recordAttribs, function(i) { 
    for (r in recordnames){ 
    i[r] <- ifelse(is.null(i[r]), NA, i[r]) 
    } 
    i <- i[recordnames] # REORDER INNER VECTORS 
    return(i) 
}) 

# TWO LIST ARGUMENT SAPPLY 
recordAttribs <- sapply(recordAttribs, function(i,r) { 
    if (is.null(i[r])) i[r] <- NA 
     else i[r] <- i[r]   
    i <- i[recordnames] # REORDER INNER VECTORS 
    return(i) 
}, recordnames) 


df <- data.frame(t(recordAttribs)) 
+0

感谢它为我提供的测试数据完美地工作。当我回去试图将其应用到完整的数据集时,我意识到有一些记录中有9列不是7,即'不起作用。有任何想法吗? –

+0

你知不知道要保持共同的属性还是全部?您是否事先知道要保留哪些属性? – Parfait

+0

是的,我想保留矢量中的所有9行,并只有NAs为7行的向量。 –

1

另一种选择是xmlAttrsToDataFrame,这应该处理缺少的属性。您还可以获取具有特定属性的标签,如设备

XML:::xmlAttrsToDataFrame(xml["//Record"]) 
XML:::xmlAttrsToDataFrame(xml["//Record[@device]"]) 
+0

这个工程也很棒。谢谢! –