2016-10-10 55 views
2

我拼命尝试在下面的数据集中更改我的字符串变量day,car2Python将字符串转换为分类 - numpy

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 23653 entries, 0 to 23652 
Data columns (total 7 columns): 
day    23653 non-null object 
clustDep   23653 non-null int64 
clustArr   23653 non-null int64 
car2    23653 non-null object 
clustRoute  23653 non-null int64 
scheduled_seg 23653 non-null int64 
delayed   23653 non-null int64 
dtypes: int64(5), object(2) 
memory usage: 1.4+ MB 
None 

我已经试过一切是SO,因为你可以在下面的代码示例中看到的。我正在运行Python 2.7, numpy 1.11.1。我试过scikits.tools.categorical,但没有vail,它不会加载命名空间。这是我的代码:

import numpy as np 
#from scikits.statsmodels import sm 

trainId = np.random.choice(range(df.shape[0]), size=int(df.shape[0]*0.8), replace=False) 
train = df[['day', 'clustDep', 'clustArr', 'car2', 'clustRoute', 'scheduled_seg', 'delayed']] 

#for col in ['day', 'car2', 'scheduled_seg']: 
# train[col] = train.loc[:, col].astype('category') 

train['day'] = train['day'].astype('category') 
#train['day'] = sm.tools.categorical(train, cols='day', drop=True) 
#train['car2C'] = train['car2'].astype('category') 
#train['scheduled_segC'] = train['scheduled_seg'].astype('category') 


train = df.loc[trainId, train.columns] 
testId = np.in1d(df.index.values, trainId, invert=True) 
test = df.loc[testId, train.columns] 


#from sklearn import tree 
#clf = tree.DecisionTreeClassifier() 
#clf = clf.fit(train.drop(['delayed'], axis=1), train['delayed']) 

这会产生以下错误:

/Users/air/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame. 
Try using .loc[row_indexer,col_indexer] = value instead 

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy 

任何帮助将不胜感激。 非常感谢!

--- UPDATE --- 样本数据:

   day clustDep clustArr car2 clustRoute scheduled_seg delayed 
0 Saturday  12  15 AA   1    5  1 
1 Tuesday  12  15 AA   1    1  1 
2 Tuesday  12  15 AA   1    5  1 
3 Saturday  12  13 AA   4    3  1 
4 Saturday   2  13 AB   6    3  1 
5 Wednesday   2  13 IB   6    3  1 
6  Monday   2  13 EY   6    3  0 
7  Friday   2  13 EY   6    3  1 
8 Saturday  11  13 AC   6    5  1 
9  Friday  11  13 DL   6    5  1 
+0

你能提供一些样本数据?例如:'print(train.sample(n = 10))' – MaxU

+0

当然,请参阅最新的问题。谢谢! –

回答

1

它工作得很好,我(熊猫0.19.0):

In [155]: train 
Out[155]: 
     day clustDep clustArr car2 clustRoute scheduled_seg delayed 
0 Saturday  12  15 AA   1    5  1 
1 Tuesday  12  15 AA   1    1  1 
2 Tuesday  12  15 AA   1    5  1 
3 Saturday  12  13 AA   4    3  1 
4 Saturday   2  13 AB   6    3  1 
5 Wednesday   2  13 IB   6    3  1 
6  Monday   2  13 EY   6    3  0 
7  Friday   2  13 EY   6    3  1 
8 Saturday  11  13 AC   6    5  1 
9  Friday  11  13 DL   6    5  1 

In [156]: train.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 10 entries, 0 to 9 
Data columns (total 7 columns): 
day    10 non-null object 
clustDep   10 non-null int64 
clustArr   10 non-null int64 
car2    10 non-null object 
clustRoute  10 non-null int64 
scheduled_seg 10 non-null int64 
delayed   10 non-null int64 
dtypes: int64(5), object(2) 
memory usage: 640.0+ bytes 

In [157]: train.day = train.day.astype('category') 

In [158]: train.car2 = train.car2.astype('category') 

In [159]: train.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 10 entries, 0 to 9 
Data columns (total 7 columns): 
day    10 non-null category 
clustDep   10 non-null int64 
clustArr   10 non-null int64 
car2    10 non-null category 
clustRoute  10 non-null int64 
scheduled_seg 10 non-null int64 
delayed   10 non-null int64 
dtypes: category(2), int64(5) 
memory usage: 588.0 bytes 
+0

这很有趣。我有大熊猫0.18.1我会再次上线后更新。我正在运行OS X,这可能是原因吗? –

+0

哪行代码产生这个错误信息? – MaxU

+0

第11行,我试图强迫火车[''']到分类 –