2017-10-21 137 views
3

对于下面的代码,我的r平方分数出来为负,但我的精度分数使用K-双倍交叉验证即将达到92%。这可能怎么样?我使用随机森林回归算法来预测一些数据。该数据集的链接在下面的链接中给出: https://www.kaggle.com/ludobenistant/hr-analytics我r平方得分就要负但我的精确度得分使用k重交叉验证即将约92%

import numpy as np 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder 

dataset = pd.read_csv("HR_comma_sep.csv") 
x = dataset.iloc[:,:-1].values ##Independent variable 
y = dataset.iloc[:,9].values  ##Dependent variable 

##Encoding the categorical variables 

le_x1 = LabelEncoder() 
x[:,7] = le_x1.fit_transform(x[:,7]) 
le_x2 = LabelEncoder() 
x[:,8] = le_x1.fit_transform(x[:,8]) 
ohe = OneHotEncoder(categorical_features = [7,8]) 
x = ohe.fit_transform(x).toarray() 


##splitting the dataset in training and testing data 

from sklearn.cross_validation import train_test_split 
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1) 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0) 

from sklearn.preprocessing import StandardScaler 
sc_x = StandardScaler() 
x_train = sc_x.fit_transform(x_train) 
x_test = sc_x.transform(x_test) 
sc_y = StandardScaler() 
y_train = sc_y.fit_transform(y_train) 

from sklearn.ensemble import RandomForestRegressor 
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0) 
regressor.fit(x_train, y_train) 

y_pred = regressor.predict(x_test) 
print(y_pred) 
from sklearn.metrics import r2_score 
r2_score(y_test , y_pred) 

from sklearn.model_selection import cross_val_score 
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10) 
accuracies.mean() 
accuracies.std() 

回答

3

有你的问题的几个问题...

对于初学者来说,你正在做一个很基本的错误:你您使用精度指标,而你在回归设置,下面实际使用的度量是mean squared error(MSE)。

精度在分类使用的度量,它与的正确分类的例子百分比做 - 检查更多的细节Wikipedia条目。

在你选择的回归(随机森林)内部使用的度量包含在regressor.fit(x_train, y_train)命令的详细输出 - 注意到criterion='mse'论点:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, 
      max_features='auto', max_leaf_nodes=None, 
      min_impurity_split=1e-07, min_samples_leaf=1, 
      min_samples_split=2, min_weight_fraction_leaf=0.0, 
      n_estimators=10, n_jobs=1, oob_score=False, random_state=0, 
      verbose=0, warm_start=False) 

MSE是正连续量,并且它不是由1上界,也就是说,如果你得到了0.92的数值,这意味着......嗯,0.92, 92%。

知道了,这是很好的做法,明确列入MSE为您的交叉验证的打分函数:

cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error') 
cv_mse.mean() 
# -2.433430574463703e-28 

对于所有实用的目的,这是零 - 你适合培训几乎集完美;确认,这里是你的培训(再次完美)的R平方得分集:

train_pred = regressor.predict(x_train) 
r2_score(y_train , train_pred) 
# 1.0 

但是,一如既往,真理的时刻,当你申请你的测试集模型来;在这里您第二的错误是,因为你与缩放y_train训练你的回归,你也应该评估之前规模y_test

y_test = sc_y.fit_transform(y_test) 
r2_score(y_test , y_pred) 
# 0.9998476914664215 

,你会得到一个非常漂亮的R平方在测试集(接近到1)。

怎么样的MSE?

from sklearn.metrics import mean_squared_error 
mse_test = mean_squared_error(y_test, y_pred) 
mse_test 
# 0.00015230853357849051 
+1

谢谢!!!!!!! –

+0

@AnantVikramSingh你很受欢迎 – desertnaut

相关问题