2012-02-24 254 views
0

这是从stats.stackexchange转发,我没有得到满意的答复。我有两个数据集,第一个在学校,第二个列出每个学校谁在标准化测试(强调故意)失败的学生。假数据集可以通过(感谢Tharen)产生:R:分层数据的贝叶斯逻辑回归

#random school data for 30 schools 
schools.num = 30 
schools.data = data.frame(school_id=seq(1,schools.num) 
         ,tot_white=sample(100:300,schools.num,TRUE) 
         ,tot_black=sample(100:300,schools.num,TRUE) 
         ,tot_asian=sample(100:300,schools.num,TRUE) 
         ,school_rev=sample(4e6:6e6,schools.num,TRUE) 
         ) 

#total students in each school 
schools.data$tot_students = schools.data$tot_white + schools.data$tot_black + schools.data$tot_asian 
#sum of all students all schools 
tot_students = sum(schools.data$tot_white, schools.data$tot_black, schools.data$tot_asian) 
#generate some random failing students 
fail.num = as.integer(tot_students * 0.05) 

students = data.frame(student_id=sample(seq(1:tot_students), fail.num, FALSE) 
         ,school_id=sample(1:schools.num, fail.num, TRUE) 
         ,race=sample(c('white', 'black', 'asian'), fail.num, TRUE) 
        ) 

我想估计P(失败= 1 |学生种族,学校收入)。如果我在学生数据集上运行多项式离散选择模型,我将明确地估计P(Race | Fail = 1)。我显然必须估计这个的倒数。由于所有信息都可以在两个数据集中获得(P(失败),P(竞赛),收入),我没有理由不能做到这一点。但是我很难理解如何在R中实现。任何指针都会非常感谢。谢谢。

回答

1

如果您有一个数据框架,它会更容易。

library(reshape2) 
library(plyr) 
d1 <- ddply(
    students, 
    c("school_id", "race"), 
    summarize, 
    fail=length(student_id) 
) 
d2 <- with(schools.data, data.frame( 
    school_id = school_id, 
    white = tot_white, 
    black = tot_black, 
    asian = tot_asian, 
    school_rev = school_rev 
)) 
d2 <- melt(d2, 
    id.vars=c("school_id", "school_rev"), 
    variable.name="race", 
    value.name="total" 
) 
d <- merge(d1, d2, by=c("school_id", "race")) 
d$pass <- d$total - d$fail 

然后你可以看一下数据

library(lattice) 
xyplot(d$fail/d$total ~ school_rev | race, data=d) 

或计算你想要的任何东西。

r <- glm(
    cbind(fail,pass) ~ race + school_rev, 
    data=d, 
    family=binomial() # Logistic regression (not bayesian) 
) 
summary(r) 

(编辑)如果您有关于失败的学生, 但只有汇总数据的传递者的更多信息, 你可以重新创建一个完整的数据集如下。

# Unique student_id for the passed students 
d3 <- ddply(d, 
    c("school_id", "race"), 
    summarize, student_id=1:pass 
) 
d3$student_id <- - seq_len(nrow(d3)) 
# All students 
d3$result <- "pass" 
students$result <- "fail" 
d3 <- merge(# rather than rbind, in case there are more columns 
    d3, students, 
    by=c("student_id", "school_id", "race", "result"), 
    all=TRUE 
) 
# Students and schools in a single data.frame 
d3 <- merge(d3, schools.data, by="school_id", all=TRUE) 
# Check that the results did not change 
r <- glm(
    (result=="fail") ~ race + school_rev, 
    data=d3, 
    family=binomial() 
) 
summary(r) 
+0

文森特,谢谢你。父母收入表示,到学校级别的问题是,我不能包括额外的学生级别特征。这就是为什么我想要一个明确的分层估计逆概率的方法。 – user702432 2012-02-24 08:13:57

+0

在这种情况下,我仍然建议将所有内容放在同一个data.frame (包括school_id,student_id,race,result,school_rev等), ,但是您还需要通过测试的学生的行。 – 2012-02-24 08:24:34

+0

这就是问题所在。我在学生层面有一个截断的样本 - 这就是为什么我想要沿着混合建模的思路想一些东西。 – user702432 2012-02-24 08:28:38

0

您需要一个包含所有学生信息的数据集。两者都失败并通过。

schools.num = 30 
schools.data = data.frame(school_id=seq(1,schools.num) 
          ,tot_white=sample(100:300,schools.num,TRUE) 
          ,tot_black=sample(100:300,schools.num,TRUE) 
          ,tot_asian=sample(100:300,schools.num,TRUE) 
          ,school_rev=sample(4e6:6e6,schools.num,TRUE) 
         ) 

library(plyr) 
fail_ratio <- 0.05 
dataset <- ddply(schools.data, .(school_id, school_rev), function(x){ 
    data.frame(Fail = rbinom(sum(x$tot_white, x$tot_asian, x$tot_black), size = 1, prob = fail_ratio), Race = c(rep("white", x$tot_white), rep("asian", x$tot_asian), rep("black", x$tot_black))) 
}) 
dataset$Race <- factor(dataset$Race) 

然后,您可以使用glmer()作为lme4包的频率方法。

library(lme4) 
glmer(Fail ~ school_rev + Race + (1|school_id), data = dataset, family = binomial) 

如果您需要贝叶斯估计,请查看MCMCglmm软件包。