2016-11-29 98 views
0

我试图在for循环中从python中的statsmodel运行logit回归。所以我每次从测试数据中追加一行到我的训练数据数据框中,并重新运行回归并存储结果。奇怪的错误是阻止我测试我的logit回归分类器吗?

现在,有趣的是,测试数据没有得到正确追加(我认为这导致了KeyError:0,我得到,但邀请您的意见在这里)。我试过导入测试数据的两个版本 - 一个与培训数据相同的标签,另一个没有声明标签。

这里是我的代码:

import pandas as pd 
import numpy as np 
import statsmodels.api as sm 
import datetime 

df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv') 
print('Training set') 
print(df_train.head(15)) 

train_cols = df_train.columns[1:] 
logit = sm.Logit(df_train['Income'], df_train[train_cols]) 
result = logit.fit() 

print("ODDS RATIO") 
print(result.params) 
print("RESULTS SUMMARY") 
print(result.summary()) 
print("CONFIDENCE INTERVAL") 
print(result.conf_int()) 

#appnd test data 

print("PREDICTION PROCESS") 
print("READING TEST DATA") 
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv') 
print("TEST DATA READ COMPLETE") 

iteration_time = [] 
iteration_result = [] 
iteration_params = [] 
iteration_conf_int = [] 

df_train.to_pickle('train_iteration.pickle') 
print(df_test.head()) 

print("Loop begins") 

for row in range(0,len(df_test)): 
    start_time = datetime.datetime.now() 
    print("Loop iteration ", row, " in ", len(df_test), " rows") 

    df_train = pd.read_pickle('train_iteration.pickle') 
    print("pickle read") 
    df_train.append(df_test[row]) 
    print("row ", row, " appended") 
    train_cols = df_train.columns[1:] 
    print("X variables extracted in new DataFrame") 
    logit = sm.Logit(df_train['Income'], df_train[train_cols]) 
    print("Def logit reg eqn") 
    result = logit.fit() 
    print("fit logit reg eqn") 
    iteration_result[row] = result.summary() 
    print("logit result summary stored in array") 
    iteration_params[row] = result.params 
    print("logit params stored in array") 
    iteration_conf_int[row] = result.conf_int() 
    print("logit conf_int stored in array") 

    df_train.to_pickle('train_iteration.pickle') 
    print("exported to pickle") 

    end_time = datetime.datetime.now() 
    time_diff = start_time - end_time 
    print("time for this iteration is ", time_diff) 
    iteration_time[row] = time_diff 
    print("ending iteration, starting next iteration of loop...") 

print("Loop ends") 

pd.DataFrame(iteration_result) 
pd.DataFrame(iteration_time) 
print (iteration_result.head()) 
print (iteration_time.head()) 

它打印到此级别:

Loop iteration 0 in 15060 rows 
pickle read 

但随后生成KeyError: 0

我在做什么错在这里?有没有标签

Income Age Workclass Education Marital_Status Occupation \ 
0  0 1   4   7    4   6 
1  0 1   4   9    2   4 
2  1 1   6   12    2   10 
3  1 1   4   10    2   6 
4  0 1   4   6    4   7 

    Relationship Race Sex Capital_gain Capital_loss Hours_per_week 
0    3  2 0    0    0    40 
1    0  4 0    0    0    50 
2    0  4 0    0    0    40 
3    0  2 0   7688    0    40 
4    1  4 0    0    0    30 

测试数据的版本:

0 1 4 7 4.1 6 3 2 0.1 0.2 0.3 40 
0 0 1 4 9 2 4 0 4 0  0 0 50 
1 1 1 6 12 2 10 0 4 0  0 0 40 
2 1 1 4 10 2 6 0 2 0 7688 0 40 
3 0 1 4 6 4 7 1 4 0  0 0 30 
4 1 2 2 15 2 9 0 4 0 3103 0 32 

在这两种情况下,如果我用标记或未标记的训练数据

有标签匹配训练数据的测试数据的版本,我仍然在同一时间得到同样的错误。

任何人都可以指导我如何继续下去?

更新:这里是完整的错误消息(前三行报表打印,错误从第四行开始):

Loop begins 
Loop iteration 0 in 15060 rows 
pickle read 
Traceback (most recent call last): 

    File "<ipython-input-10-1f56d5243e43>", line 1, in <module> 
    runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier') 

    File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile 
    execfile(filename, namespace) 

    File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile 
    exec(compile(f.read(), filename, 'exec'), namespace) 

    File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module> 
    df_train.append(df_test[row]) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__ 
    return self._getitem_column(key) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column 
    return self._get_item_cache(key) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache 
    values = self._data.get(item) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get 
    loc = self.items.get_loc(item) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc 
    return self._engine.get_loc(self._maybe_cast_indexer(key)) 

    File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443) 

    File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289) 

    File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733) 

    File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687) 

KeyError: 0 

UDPATE: 我得到这个在打印的最后一行(df_train.std ())语句,在所有列的std开发之后。 dtype: float64 所以,我猜我的训练数据框被视为浮动。

+0

我宁愿标记数据以......开始......因为在未标记数据中第一行正在被分配为标题,看看你的没有标签的测试数据版本。您可以粘贴尝试使用标记测试数据时遇到的错误吗? –

+0

嗨,是的,在问题中添加了错误信息。看一看。 –

+0

此错误是因为在未标记的测试集中,第一行正在被读为列标题...你可以尝试使用带标签的测试集的附件,并让我们知道错误?此外,请检查您何时加载标记测试集,'header = True'存在 –

回答

1

我想我明白了...而不是下面的代码 -

df_train.append(df_test[row]) 
print("row ", row, " appended") 

重写它 -

df_train.append(df_test.iloc[row]) 
df_train = df_train.reset_index() 
print("row ", row, " appended") 

让我知道如果这个服务的目的......它的种类每次重置索引都很重要......只是一件事 - 如果你的测试集相当大,这将是一场计算性灾难,针对测试中看到的每个数据点进行培训......

只是一条建议外部环境 - 如果你确实想要近实时地训练它,试试使用批次或大块测试集...