2017-02-25 54 views
2

我想这个问题可能之前已经被问过了,但是在研究之后我找不到任何东西。我是解析XML文档的新手。我试图解析看起来像这样的XML页面:解析XML中的R,挣扎着不同的行错误

schedule = xmlParse("MYXML.XML") 

# here's what schedule looks like 
<all-games> 
    <game-schedule> 
    <team name="Knicks"> 
    <outcome winner="OtherTeam"> 
    </game-schedule> 
    <game-schedule> 
    <team name="Lakers"> 
    <outcome winner="HomeTeam"> 
    </game-schedule> 
    <game-schedule> 
    <team name="Celtics"> 
    </game-schedule> 
</all-games> 


# here's my code to parse the XML 
my_df = data.frame(
    team = sapply(schedule["//game-schedule/team/@name"], as, "character"), 
    winner = sapply(schedule["//game-schedule/outcome/@winner"], as, "character") 
) 

,我得到了以下预期的错误(预期因为没有3:

Error in data.frame(Visitor = sapply(schedule["//game-schedule/team/@name"], : 
arguments imply differing number of rows: 3, 2 

我想解析数据帧这样的。那个失踪的孩子被简单地填写为NA也就是说,我试图得到以下数据框:

my_df 
     team  winner 
1 Knicks OtherTeam 
2 Lakers HomeTeam 
3 Celtics   NA 

的NA反映了XML文档,游戏还没有发生在

回答

1

如果标签丢失,您需要一个可以返回NA的包装,类似xpath2下面的xpathSApply。然后获取节点并在当前节点的任何地方应用xpath2“。”

xpath2 <-function(x, ...){ 
    y <- xpathSApply(x, ...) 
    ifelse(length(y) == 0, NA, paste(y, collapse=", ")) 
} 
nd <- getNodeSet(schedule, "//game-schedule") 
data.frame(
    team = sapply(nd, xpath2, ".//team", xmlGetAttr, "name"), 
winner = sapply(nd, xpath2, ".//outcome", xmlGetAttr, "winner") 
) 
team winner 
1 Knicks OtherTeam 
2 Lakers HomeTeam 
3 Celtics  <NA>