Python将字符串转换为分类 - numpy

我拼命尝试在下面的数据集中更改我的字符串变量day,car2。Python将字符串转换为分类 - numpy

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 23653 entries, 0 to 23652 
Data columns (total 7 columns): 
day    23653 non-null object 
clustDep   23653 non-null int64 
clustArr   23653 non-null int64 
car2    23653 non-null object 
clustRoute  23653 non-null int64 
scheduled_seg 23653 non-null int64 
delayed   23653 non-null int64 
dtypes: int64(5), object(2) 
memory usage: 1.4+ MB 
None

我已经试过一切是SO，因为你可以在下面的代码示例中看到的。我正在运行Python 2.7, numpy 1.11.1。我试过scikits.tools.categorical，但没有vail，它不会加载命名空间。这是我的代码：

import numpy as np 
#from scikits.statsmodels import sm 

trainId = np.random.choice(range(df.shape[0]), size=int(df.shape[0]*0.8), replace=False) 
train = df[['day', 'clustDep', 'clustArr', 'car2', 'clustRoute', 'scheduled_seg', 'delayed']] 

#for col in ['day', 'car2', 'scheduled_seg']: 
# train[col] = train.loc[:, col].astype('category') 

train['day'] = train['day'].astype('category') 
#train['day'] = sm.tools.categorical(train, cols='day', drop=True) 
#train['car2C'] = train['car2'].astype('category') 
#train['scheduled_segC'] = train['scheduled_seg'].astype('category') 


train = df.loc[trainId, train.columns] 
testId = np.in1d(df.index.values, trainId, invert=True) 
test = df.loc[testId, train.columns] 


#from sklearn import tree 
#clf = tree.DecisionTreeClassifier() 
#clf = clf.fit(train.drop(['delayed'], axis=1), train['delayed'])

这会产生以下错误：

/Users/air/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame. 
Try using .loc[row_indexer,col_indexer] = value instead 

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

任何帮助将不胜感激。非常感谢！

--- UPDATE --- 样本数据：

   day clustDep clustArr car2 clustRoute scheduled_seg delayed 
0 Saturday  12  15 AA   1    5  1 
1 Tuesday  12  15 AA   1    1  1 
2 Tuesday  12  15 AA   1    5  1 
3 Saturday  12  13 AA   4    3  1 
4 Saturday   2  13 AB   6    3  1 
5 Wednesday   2  13 IB   6    3  1 
6  Monday   2  13 EY   6    3  0 
7  Friday   2  13 EY   6    3  1 
8 Saturday  11  13 AC   6    5  1 
9  Friday  11  13 DL   6    5  1

来源

2016-10-10 Jan Sila

你能提供一些样本数据？例如：'print（train.sample（n = 10））' – MaxU

当然，请参阅最新的问题。谢谢！ –

它工作得很好，我（熊猫0.19.0）：

In [155]: train 
Out[155]: 
     day clustDep clustArr car2 clustRoute scheduled_seg delayed 
0 Saturday  12  15 AA   1    5  1 
1 Tuesday  12  15 AA   1    1  1 
2 Tuesday  12  15 AA   1    5  1 
3 Saturday  12  13 AA   4    3  1 
4 Saturday   2  13 AB   6    3  1 
5 Wednesday   2  13 IB   6    3  1 
6  Monday   2  13 EY   6    3  0 
7  Friday   2  13 EY   6    3  1 
8 Saturday  11  13 AC   6    5  1 
9  Friday  11  13 DL   6    5  1 

In [156]: train.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 10 entries, 0 to 9 
Data columns (total 7 columns): 
day    10 non-null object 
clustDep   10 non-null int64 
clustArr   10 non-null int64 
car2    10 non-null object 
clustRoute  10 non-null int64 
scheduled_seg 10 non-null int64 
delayed   10 non-null int64 
dtypes: int64(5), object(2) 
memory usage: 640.0+ bytes 

In [157]: train.day = train.day.astype('category') 

In [158]: train.car2 = train.car2.astype('category') 

In [159]: train.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 10 entries, 0 to 9 
Data columns (total 7 columns): 
day    10 non-null category 
clustDep   10 non-null int64 
clustArr   10 non-null int64 
car2    10 non-null category 
clustRoute  10 non-null int64 
scheduled_seg 10 non-null int64 
delayed   10 non-null int64 
dtypes: category(2), int64(5) 
memory usage: 588.0 bytes

来源

2016-10-10 18:36:23 MaxU

这很有趣。我有大熊猫0.18.1我会再次上线后更新。我正在运行OS X，这可能是原因吗？ –

哪行代码产生这个错误信息？ – MaxU

第11行，我试图强迫火车[''']到分类 –

Python将字符串转换为分类 - numpy

回答

相关问题