2017-02-20 101 views
0

我正在为Kaggle比赛开发CTR预测模型(link)。我从训练数据集的前10万线的读,然后再在这个由80/20如何在R中测试逻辑回归模型?

ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000) 
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1) 
ad_train <- ad_data[trainIndex,] 
ad_test <- ad_data[-trainIndex,] 

然后分成火车/测试集我用ad_train数据制定GLM模型

ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train) 

但每当我尝试使用预测功能来看看它是如何以及是否在ad_test集,我得到的错误:

test_model <- predict(ad_glm_model, newdata = ad_test, type = "response") 
Warning message: 
'newdata' had 20000 rows but variables found have 80000 rows 

是怎么回事?我如何在新数据上测试我的GLM模型?

编辑:它完美的作品。只需要执行此调用:

ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train) 
+1

不要在GLM通话使用'ad_train $',只要使用'数据='代替 – user20650

回答

0

发生这种情况是因为您在模型公式中包含每个变量的数据框的名称。相反,你的公式应该是:

glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train) 

如重复的通知中描述second link

This is a problem of using different names between your data and your newdata and not a problem between using vectors or dataframes.

When you fit a model with the lm function and then use predict to make predictions, predict tries to find the same names on your newdata. In your first case name x conflicts with mtcars$wt and hence you get the warning.