2016-12-30 35 views
4

根据数据框的行与tidyverse中的list-columns数据结构的不同,适合不同模型公式的最佳方法是什么?为列表数据框的每一行安装不同的模型

在R for Data Science中,Hadley提供了一个很好的例子,说明如何使用列列数据结构并轻松地适应多个模型(http://r4ds.had.co.nz/many-models.html#gapminder)。我正试图找到一种方式来适应许多具有稍微不同公式的模型。在下面的例子中,从他原来的例子改编而来,在每个大陆适合不同模型的最佳方法是什么?

library(gapminder) 
library(dplyr) 
library(tidyr) 
library(purrr) 
library(broom) 

by_continent <- gapminder %>% 
    group_by(continent) %>% 
    nest() 

by_continent <- by_continent %>% 
    mutate(model = map(data, ~lm(lifeExp ~ year, data = .))) 

by_continent %>% 
    mutate(glance=map(model, glance)) %>% 
    unnest(glance, .drop=T) 

## A tibble: 5 × 12 
# continent r.squared adj.r.squared  sigma statistic  p.value df 
#  <fctr>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <int> 
#1  Asia 0.4356350  0.4342026 8.9244419 304.1298 6.922751e-51  2 
#2 Europe 0.4984659  0.4970649 3.8530964 355.8099 1.344184e-55  2 
#3 Africa 0.2987543  0.2976269 7.6685811 264.9929 6.780085e-50  2 
#4 Americas 0.4626467  0.4608435 6.8618439 256.5699 4.354220e-42  2 
#5 Oceania 0.9540678  0.9519800 0.8317499 456.9671 3.299327e-16  2 
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, 
## deviance <dbl>, df.residual <int> 

我知道,因为它估计各大洲每个模型我可以通过by_continent(效率不高,迭代做到这一点:

formulae <- list(
    Asia=~lm(lifeExp ~ year, data = .), 
    Europe=~lm(lifeExp ~ year + pop, data = .), 
    Africa=~lm(lifeExp ~ year + gdpPercap, data = .), 
    Americas=~lm(lifeExp ~ year - 1, data = .), 
    Oceania=~lm(lifeExp ~ year + pop + gdpPercap, data = .) 
) 

for (i in 1:nrow(by_continent)) { 
    by_continent$model[[i]] <- map(by_continent$data, formulae[[i]])[[i]] 
} 

by_continent %>% 
    mutate(glance=map(model, glance)) %>% 
    unnest(glance, .drop=T) 

## A tibble: 5 × 12 
# continent r.squared adj.r.squared  sigma statistic  p.value df 
#  <fctr>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl> <int> 
#1  Asia 0.4356350  0.4342026 8.9244419 304.1298 6.922751e-51  2 
#2 Europe 0.4984677  0.4956580 3.8584819 177.4093 3.186760e-54  3 
#3 Africa 0.4160797  0.4141991 7.0033542 221.2506 2.836552e-73  3 
#4 Americas 0.9812082  0.9811453 8.9703814 15612.1901 4.227928e-260  1 
#5 Oceania 0.9733268  0.9693258 0.6647653 243.2719 6.662577e-16  4 
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, 
## deviance <dbl>, df.residual <int> 

但有可能做到这一点,而不在基础跟随回圈R(和避免装修款,我不需要)

我想什么是这样的:?

by_continent <- by_continent %>% 
left_join(tibble::enframe(formulae, name="continent", value="formula")) 

by_continent %>% 
    mutate(model=map2(data, formula, est_model)) 

但我似乎无法想出一个可行的est_model函数。我想这个功能(H/T:https://gist.github.com/multidis/8138757)不起作用。

est_model <- function(data, formula, ...) { 
    mc <- match.call() 
    m <- match(c("formula","data"), names(mc), 0L) 
    mf <- mc[c(1L, m)] 
    mf[[1L]] <- as.name("model.frame") 
    mf <- eval(mf, parent.frame()) 
    data.st <- data.frame(mf) 

    return(data.st) 
} 

(诚然,这是一个人为的例子我的实际情况是,我有大量的观察遗漏在我的数据关键自变量,所以我想,以配合上完整的观察所有变量一个模型,另一个只在休息观测变量的一个子集。)

UPDATE

我想出了一个est_model功能的工作原理(尽管可能效率不高):

est_model <- function(data, formula, ...) { 
    map(list(data), formula, ...)[[1]] 
} 

by_continent <- by_continent %>% 
    mutate(model=map2(data, formula, est_model)) 

by_continent %>% 
    mutate(glance=map(model, glance)) %>% 
    unnest(glance, .drop=T) 

## A tibble: 5 × 12 
# continent r.squared adj.r.squared  sigma statistic  p.value df 
#  <chr>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl> <int> 
#1  Asia 0.4356350  0.4342026 8.9244419 304.1298 6.922751e-51  2 
#2 Europe 0.4984677  0.4956580 3.8584819 177.4093 3.186760e-54  3 
#3 Africa 0.4160797  0.4141991 7.0033542 221.2506 2.836552e-73  3 
#4 Americas 0.9812082  0.9811453 8.9703814 15612.1901 4.227928e-260  1 
#5 Oceania 0.9733268  0.9693258 0.6647653 243.2719 6.662577e-16  4 
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, 
## df.residual <int> 
+1

不知道为什么:(我觉得我已经做了我的研究,任何暗示,评论,指向正确的方向认识 –

回答

3

我觉得很容易制作一个模型公式列表。每个型号仅适用于相应的continent。我添加一个新的列formula到嵌套的数据,以确保formulacontinent是否在相同的顺序,如果他们不是。

formulae <- c(
    Asia= lifeExp ~ year, 
    Europe= lifeExp ~ year + pop, 
    Africa= lifeExp ~ year + gdpPercap, 
    Americas= lifeExp ~ year - 1, 
    Oceania= lifeExp ~ year + pop + gdpPercap 
) 

df <- gapminder %>% 
    group_by(continent) %>% 
    nest() %>% 
    mutate(formula = formulae[as.character(continent)]) %>% 
    mutate(model = map2(formula, data, ~ lm(.x, .y))) %>% 
    mutate(glance=map(model, glance)) %>% 
    unnest(glance, .drop=T) 

# # A tibble: 5 × 12 
# continent r.squared adj.r.squared  sigma statistic  p.value df  logLik  AIC  BIC 
#  <fctr>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl> <int>  <dbl>  <dbl>  <dbl> 
# 1  Asia 0.4356350  0.4342026 8.9244419 304.1298 6.922751e-51  2 -1427.65947 2861.31893 2873.26317 
# 2 Europe 0.4984677  0.4956580 3.8584819 177.4093 3.186760e-54  3 -995.41016 1998.82033 2014.36475 
# 3 Africa 0.4160797  0.4141991 7.0033542 221.2506 2.836552e-73  3 -2098.46089 4204.92179 4222.66639 
# 4 Americas 0.9812082  0.9811453 8.9703814 15612.1901 4.227928e-260  1 -1083.35918 2170.71836 2178.12593 
# 5 Oceania 0.9733268  0.9693258 0.6647653 243.2719 6.662577e-16  4 -22.06696 54.13392 60.02419 
# # ... with 2 more variables: deviance <dbl>, df.residual <int> 
+0

这可能是我也可以用modelr :: formula()创建公式,尽管我更喜欢用公式来保留建模函数(“lm”),所以它在summary()中显示出来,来电。Thx! –

1

我发现purrr::at_depth()是做什么,我想在原来的问题与est_model()做。这是解决我现在高兴:。这里的downvote

library(gapminder) 
library(tidyverse) 
library(purrr) 
library(broom) 

fmlas <- tibble::tribble(
    ~continent, ~formula, 
    "Asia", ~lm(lifeExp ~ year, data = .), 
    "Europe", ~lm(lifeExp ~ year + pop, data = .), 
    "Africa", ~lm(lifeExp ~ year + gdpPercap, data = .), 
    "Americas", ~lm(lifeExp ~ year - 1, data = .), 
    "Oceania", ~lm(lifeExp ~ year + pop + gdpPercap, data = .) 
) 

by_continent <- gapminder %>% 
    nest(-continent) %>% 
    left_join(fmlas) %>% 
    mutate(model=map2(data, formula, ~at_depth(.x, 0, .y))) 

by_continent %>% 
    mutate(glance=map(model, glance)) %>% 
    unnest(glance, .drop=T) 
相关问题