2010-08-04 78 views
3

我不认为这有问题,但是有没有办法将多层次,不均匀结构的列表信息合并为一个“长”格式的数据帧?将不均匀的分层列表转换为数据帧

具体来说:

library(XML) 
library(plyr) 
xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml" 
xml.parse <- xmlInternalTreeParse(xml.inning) 
xml.list <- xmlToList(xml.parse) 
## $top$atbat 
## $top$atbat$pitch 
##    des    id   type    x    y 
##   "Ball"   "310"    "B"   "70.39"  "125.20" 

凡下面是结构:

> llply(xml.list, function(x) llply(x, function(x) table(names(x)))) 
$top 
$top$atbat 
.attrs pitch 
    1  4 
$top$atbat 
.attrs pitch 
    1  4 
$top$atbat 
.attrs pitch 
    1  5 
$bottom 
$bottom$action 
    b des event  o pitch player  s 
    1  1  1  1  1  1  1 
$bottom$atbat 
.attrs pitch 
    1  5 
$bottom$atbat 
.attrs pitch 
    1  5 
$bottom$atbat 
.attrs pitch runner 
    1  5  1 
$bottom$atbat 
.attrs pitch runner 
    1  7  1 
$.attrs 
$.attrs$num 
character(0) 
$.attrs$away_team 
character(0) 
$.attrs$ 

我想是有从命名矢量从间距类别的数据帧,沿着(top,atbat,bottom)。因此,由于列数不同,我需要忽略不适合data.frame的级别。事情是这样的:

first second third des  x 
1 top atbat pitch Ball 70.29 
2 top atbat pitch Strike 69.24 
3 bottom atbat pitch Out 67.22 

是否有这样做的一个优雅的方式?谢谢!

+0

相关问题:http://stackoverflow.com/questions/2067098/how-to-transform-xml-data-into-a-data-frame – apeescape 2010-08-05 19:00:31

回答

5

我不知道优雅,但这个工程。那些更熟悉plyr的人可能可以提供更一般的解决方案。

cleanFun <- function(x) { 
    a <- x[["atbat"]] 
    b <- do.call(rbind,a[names(a)=="pitch"]) 
    c <- as.data.frame(b) 
} 
ldply(xml.list[c("top","bottom")], cleanFun)[,1:5] 
    .id    des id type  x 
1 top   Ball 310 B 70.39 
2 top Called Strike 311 S 118.45 
3 top Called Strike 312 S 86.70 
4 top In play, out(s) 313 X 79.83 
5 bottom   Ball 335 B 15.45 
6 bottom Called Strike 336 S 77.25 
7 bottom Swinging Strike 337 S 99.57 
8 bottom   Ball 338 B 106.44 
9 bottom In play, out(s) 339 X 134.76 
1

.id功能的ldply()是不错,但似乎它们重叠一旦你做另一个ldply()

这是相当使用rbind.fill()一般功能:

aho <- ldply(llply(xml.list[[1]], function(x) ldply(x, function(x) rbind.fill(data.frame(t(x)))))) 
> aho[1:5,1:4] 
    .id              des id type 
1 pitch              Ball 310 B 
2 pitch            Called Strike 311 S 
3 pitch            Called Strike 312 S 
4 pitch           In play, out(s) 313 X 
5 .attrs Alexei Ramirez lines out to second baseman Ian Kinsler. <NA> <NA> 

.id第二ldply()丢失,因为我们已经有了一个.id。我们可以通过将第一个.id命名为不同的名称来解决这个问题,但它看起来并不一致。

aho2 <- ldply(llply(xml.list[[1]], function(x) { 
    out <- ldply(x, function(x) rbind.fill(data.frame(t(x)))) 
    names(out)[1] <- ".id2" 
    out 
})) 
> aho2[1:5,1:4] 
    .id .id2              des id 
1 atbat pitch              Ball 310 
2 atbat pitch            Called Strike 311 
3 atbat pitch            Called Strike 312 
4 atbat pitch           In play, out(s) 313 
5 atbat .attrs Alexei Ramirez lines out to second baseman Ian Kinsler. <NA>