其中R解析XML文件获取到的数据帧

XML数据其中R解析XML文件获取到的数据帧

<HealthData locale="en_US"> 
<ExportDate value="2016-06-02 14:05:23 -0400"/> 
<Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/> 
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/> 
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/> 
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/> 
</HealthData>

> library(XML) 
> doc="\\pathtoXMLfile" 
> list <-xpathApply(doc, "//HealthData/Record", xmlAttrs) 
> df <- do.call(rbind.data.frame, list) 
> str(df)

我试图采取上面所示的XML数据样本并将其加载到一个数据帧R代码R与每个记录的名称即类型，sourceName，单位，endDate，值作为列标题和每个记录值即计数，2014-09-24 15:07:11 -0400,7作为每行的值在数据帧。

当df <- do.call(rbind.data.frame, list)这个关闭，但它也看起来像它绑定列标题的所有值也。如果你View(df)或str(df)你会明白我的意思。如何使用Record变量名称作为列标题名称？

感谢，瑞安

来源

2016-07-28 Ryan Praskievicz

考虑xpathSApply()检索属性，然后用t()调换结果列表到数据帧：

library(XML) 

xmlstr <- '<?xml version="1.0" encoding="UTF-8"?> 
      <HealthData locale="en_US"> 
       <ExportDate value="2016-06-02 14:05:23 -0400"/> 
       <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/> 
      </HealthData>' 

xml <- xmlParse(xmlstr) 

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs) 
df <- data.frame(t(recordAttribs)) 
df 

#        type    sourceName unit 
# 1 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
# 2 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
# 3 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
#    creationDate     startDate     endDate 
# 1 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400 
# 2 2014-10-02 08:30:17 -0400 2014-09-24 15:12:13 -0400 2014-09-24 15:12:18 -0400 
# 3 2014-10-02 08:30:17 -0400 2014-09-24 15:17:16 -0400 2014-09-24 15:17:21 -0400 
# value 
# 1  7 
# 2 15 
# 3 20

在属性的情况下，出现在一些不其他人则考虑与预先确定的名称列表进行匹配，并反复填写NAs。下面是使用sapply()与for环和第二list参数两个版本：

recordnames <- c("type", "unit", "sourceName", "device", "sourceVersion", 
       "creationDate", "startDate", "endDate", "value") 

# FOR LOOP VERSION 
recordAttribs <- sapply(recordAttribs, function(i) { 
    for (r in recordnames){ 
    i[r] <- ifelse(is.null(i[r]), NA, i[r]) 
    } 
    i <- i[recordnames] # REORDER INNER VECTORS 
    return(i) 
}) 

# TWO LIST ARGUMENT SAPPLY 
recordAttribs <- sapply(recordAttribs, function(i,r) { 
    if (is.null(i[r])) i[r] <- NA 
     else i[r] <- i[r]   
    i <- i[recordnames] # REORDER INNER VECTORS 
    return(i) 
}, recordnames) 


df <- data.frame(t(recordAttribs))

来源

2016-07-28 22:26:24 Parfait

感谢它为我提供的测试数据完美地工作。当我回去试图将其应用到完整的数据集时，我意识到有一些记录中有9列不是7，即'不起作用。有任何想法吗？ –

你知不知道要保持共同的属性还是全部？您是否事先知道要保留哪些属性？ – Parfait

是的，我想保留矢量中的所有9行，并只有NAs为7行的向量。 –

另一种选择是xmlAttrsToDataFrame，这应该处理缺少的属性。您还可以获取具有特定属性的标签，如设备

XML:::xmlAttrsToDataFrame(xml["//Record"]) 
XML:::xmlAttrsToDataFrame(xml["//Record[@device]"])

来源

2016-08-01 16:26:59

这个工程也很棒。谢谢！ –

其中R解析XML文件获取到的数据帧

回答

相关问题