我试图在for循环中从python中的statsmodel运行logit回归。所以我每次从测试数据中追加一行到我的训练数据数据框中,并重新运行回归并存储结果。奇怪的错误是阻止我测试我的logit回归分类器吗?
现在,有趣的是,测试数据没有得到正确追加(我认为这导致了KeyError:0,我得到,但邀请您的意见在这里)。我试过导入测试数据的两个版本 - 一个与培训数据相同的标签,另一个没有声明标签。
这里是我的代码:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import datetime
df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv')
print('Training set')
print(df_train.head(15))
train_cols = df_train.columns[1:]
logit = sm.Logit(df_train['Income'], df_train[train_cols])
result = logit.fit()
print("ODDS RATIO")
print(result.params)
print("RESULTS SUMMARY")
print(result.summary())
print("CONFIDENCE INTERVAL")
print(result.conf_int())
#appnd test data
print("PREDICTION PROCESS")
print("READING TEST DATA")
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv')
print("TEST DATA READ COMPLETE")
iteration_time = []
iteration_result = []
iteration_params = []
iteration_conf_int = []
df_train.to_pickle('train_iteration.pickle')
print(df_test.head())
print("Loop begins")
for row in range(0,len(df_test)):
start_time = datetime.datetime.now()
print("Loop iteration ", row, " in ", len(df_test), " rows")
df_train = pd.read_pickle('train_iteration.pickle')
print("pickle read")
df_train.append(df_test[row])
print("row ", row, " appended")
train_cols = df_train.columns[1:]
print("X variables extracted in new DataFrame")
logit = sm.Logit(df_train['Income'], df_train[train_cols])
print("Def logit reg eqn")
result = logit.fit()
print("fit logit reg eqn")
iteration_result[row] = result.summary()
print("logit result summary stored in array")
iteration_params[row] = result.params
print("logit params stored in array")
iteration_conf_int[row] = result.conf_int()
print("logit conf_int stored in array")
df_train.to_pickle('train_iteration.pickle')
print("exported to pickle")
end_time = datetime.datetime.now()
time_diff = start_time - end_time
print("time for this iteration is ", time_diff)
iteration_time[row] = time_diff
print("ending iteration, starting next iteration of loop...")
print("Loop ends")
pd.DataFrame(iteration_result)
pd.DataFrame(iteration_time)
print (iteration_result.head())
print (iteration_time.head())
它打印到此级别:
Loop iteration 0 in 15060 rows
pickle read
但随后生成KeyError: 0
我在做什么错在这里?有没有标签
Income Age Workclass Education Marital_Status Occupation \
0 0 1 4 7 4 6
1 0 1 4 9 2 4
2 1 1 6 12 2 10
3 1 1 4 10 2 6
4 0 1 4 6 4 7
Relationship Race Sex Capital_gain Capital_loss Hours_per_week
0 3 2 0 0 0 40
1 0 4 0 0 0 50
2 0 4 0 0 0 40
3 0 2 0 7688 0 40
4 1 4 0 0 0 30
测试数据的版本:
0 1 4 7 4.1 6 3 2 0.1 0.2 0.3 40
0 0 1 4 9 2 4 0 4 0 0 0 50
1 1 1 6 12 2 10 0 4 0 0 0 40
2 1 1 4 10 2 6 0 2 0 7688 0 40
3 0 1 4 6 4 7 1 4 0 0 0 30
4 1 2 2 15 2 9 0 4 0 3103 0 32
在这两种情况下,如果我用标记或未标记的训练数据
有标签匹配训练数据的测试数据的版本,我仍然在同一时间得到同样的错误。
任何人都可以指导我如何继续下去?
更新:这里是完整的错误消息(前三行报表打印,错误从第四行开始):
Loop begins
Loop iteration 0 in 15060 rows
pickle read
Traceback (most recent call last):
File "<ipython-input-10-1f56d5243e43>", line 1, in <module>
runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier')
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module>
df_train.append(df_test[row])
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 0
UDPATE: 我得到这个在打印的最后一行(df_train.std ())语句,在所有列的std开发之后。 dtype: float64
所以,我猜我的训练数据框被视为浮动。
我宁愿标记数据以......开始......因为在未标记数据中第一行正在被分配为标题,看看你的没有标签的测试数据版本。您可以粘贴尝试使用标记测试数据时遇到的错误吗? –
嗨,是的,在问题中添加了错误信息。看一看。 –
此错误是因为在未标记的测试集中,第一行正在被读为列标题...你可以尝试使用带标签的测试集的附件,并让我们知道错误?此外,请检查您何时加载标记测试集,'header = True'存在 –