2016-04-14 70 views
0

这是我第一次问这里。虚拟变量作为无拦截器的斜坡移位器

我只生成了斜坡虚拟变量(没有截距虚拟)。 但是,如果我将虚拟变量乘以自变量,如下所示, 表示斜坡虚拟和截距虚拟结果。

我想仅合并斜坡假人并排除截距假人。

我会感谢您的帮助。 最好成绩, yjkim

reg <- lm(year ~ as.factor(age)*log(v1269)) 
Call: 
lm(formula = year ~ as.factor(age) * log(v1269)) 

Residuals: 
    Min  1Q Median  3Q Max 
-6.083 -1.177 1.268 1.546 3.768 

Coefficients: 
          Estimate Std. Error t value Pr(>|t|) 
(Intercept)     5.18076 2.16089 2.398 0.0167 * 
as.factor(age)2    1.93989 2.75892 0.703 0.4821 
as.factor(age)3    2.46861 2.39393 1.031 0.3027 
as.factor(age)4   -0.56274 2.3-0.245 0.8069 
log(v1269)     -0.06788 0.23606 -0.288 0.7737 
as.factor(age)2:log(v1269) -0.15628 0.29621 -0.528 0.5979 
as.factor(age)3:log(v1269) -0.14961 0.25809 -0.580 0.5622 
as.factor(age)4:log(v1269) 0.16534 0.24884 0.664 0.5065 
+0

你想摆脱'(拦截)'学期或三'as.factor(年龄)2','as.factor(年龄)3'和'as.factor(年龄)的4'条款? –

回答

0

只需要一个-1 formaula内

reg <- lm(year ~ as.factor(age)*log(v1269) -1) 
0

如果你想在age每个级别估计不同的斜坡时,你可以使用%in%运营商在公式

set.seed(1) 
df <- data.frame(age = factor(sample(1:4, 100, replace = TRUE)), 
       v1269 = rlnorm(100), 
       year = rnorm(100)) 

m <- lm(year ~ log(v1269) %in% age, data = df) 
summary(m) 

这给出了(对于这个完全随机的,虚拟的,愚蠢的数据集)

> summary(m) 

Call: 
lm(formula = year ~ log(v1269) %in% age, data = df) 

Residuals: 
    Min  1Q Median  3Q  Max 
-2.93108 -0.66402 -0.05921 0.68040 2.25244 

Coefficients: 
       Estimate Std. Error t value Pr(>|t|) 
(Intercept)  0.02692 0.10705 0.251 0.802 
log(v1269):age1 0.20127 0.21178 0.950 0.344 
log(v1269):age2 -0.01431 0.24116 -0.059 0.953 
log(v1269):age3 -0.02588 0.24435 -0.106 0.916 
log(v1269):age4 0.06019 0.21979 0.274 0.785 

Residual standard error: 1.065 on 95 degrees of freedom 
Multiple R-squared: 0.01037, Adjusted R-squared: -0.0313 
F-statistic: 0.2489 on 4 and 95 DF, p-value: 0.9097 

注意,这符合一个常数项和4种不同的log(v1269)效果,每age水平之一。从外观上看,这是有点什么模型做

pred <- with(df, 
      expand.grid(age = factor(1:4), 
         v1269 = seq(min(v1269), max(v1269), length = 100))) 
pred <- transform(pred, fitted = predict(m, newdata = pred)) 

library("ggplot2") 
ggplot(df, aes(x = log(v1269), y = year, colour = age)) + 
    geom_point() + 
    geom_line(data = pred, mapping = aes(y = fitted)) + 
    theme_bw() + theme(legend.position = "top") 

Simulated data plus fitted slopes from the nested slope model described in the answer

显然,这只会是合适的,如果有一个在不同年龄的year(响应)的平均值无显著差异类别。

注意,相同的模型中的不同参数化可以通过/操作来实现:

m2 <- lm(year ~ log(v1269)/age, data = df) 

> m2 

Call: 
lm(formula = year ~ log(v1269)/age, data = df) 

Coefficients: 
    (Intercept)  log(v1269) log(v1269):age2 log(v1269):age3 
     0.02692   0.20127   -0.21559   -0.22715 
log(v1269):age4 
     -0.14108 

注意,现在,第一log(v1269)项是斜率为age == 1,而其它项是调整需要被应用到了log(v1269)项获得的斜率所指示组:

coef(m)[-1] 
coef(m2)[2] + c(0, coef(m2)[-(1:2)]) 

> coef(m)[-1] 
log(v1269):age1 log(v1269):age2 log(v1269):age3 log(v1269):age4 
    0.20127109  -0.01431491  -0.02588106  0.06018802 
> coef(m2)[2] + c(0, coef(m2)[-(1:2)]) 
       log(v1269):age2 log(v1269):age3 log(v1269):age4 
    0.20127109  -0.01431491  -0.02588106  0.06018802 

但他们工作到相同的估计斜坡。