1

我有从doc2vec算法创建的花车矢量,以及他们的标签。当我用一个简单的分类器来使用它们时,它可以正常工作并给出预期的准确性。工作代码如下:Scikit学习管道相同的数据和步骤无法分类

from sklearn.svm import LinearSVC 
import pandas as pd 
import numpy as np 

train_vecs #ndarray (20418,100) 
#train_vecs = [[0.3244, 0.3232, -0.5454, 1.4543, ...],...] 
y_train #labels 
test_vecs #ndarray (6885,100) 
y_test #labels 

classifier = LinearSVC() 
classifier.fit(train_vecs, y_train) 
print('Test Accuracy: %.2f'%classifier.score(test_vecs, y_test)) 

但是现在我想将它移动到一个管道,因为在未来,我计划做一个特征工会各具特色。我所做的是将矢量移动到数据框中,然后使用2个自定义变换器来选择列,ii)更改数组类型。奇怪的是,完全相同的数据,具有完全相同的形状,dtype和类型..给出0.0005的准确性。它对我来说根本没有意义,它应该给出几乎相等的准确度。在ArrayCaster变压器之后,输入的形状和类型与之前完全相同。整件事情非常令人沮丧。

from sklearn.svm import LinearSVC 
import pandas as pd 
import numpy as np 
from sklearn.pipeline import Pipeline 
from sklearn.base import BaseEstimator, TransformerMixin 

# transformer that picks a column from the dataframe 
class ItemSelector(BaseEstimator, TransformerMixin): 

    def __init__(self, column): 
     self.column = column 

    def fit(self, X, y=None, **fit_params): 
     return self 

    def transform(self, X): 
     print('item selector type',type(X[self.column])) 
     print('item selector shape',len(X[self.column])) 
     print('item selector dtype',X[self.column].dtype) 
     return (X[self.column]) 

# transformer that converts the series into an ndarray 
class ArrayCaster(BaseEstimator, TransformerMixin): 
    def fit(self, x, y=None): 
     return self 

    def transform(self, data): 
     print('array caster type',type(np.array(data.tolist()))) 
     print('array caster shape',np.array(data.tolist()).shape) 
     print('array caster dtype',np.array(data.tolist()).dtype) 
     return np.array(data.tolist()) 


train_vecs #ndarray (20418,100) 
y_train #labels 
test_vecs #ndarray (6885,100) 
y_test #labels 

train['vecs'] = pd.Series(train_vecs.tolist()) 
val['vecs'] = pd.Series(test_vecs.tolist()) 


classifier = Pipeline([ 
      ('selector', ItemSelector(column='vecs')), 
      ('array', ArrayCaster()), 
      ('clf',LinearSVC())]) 

classifier.fit(train, y_train) 
print('Test Accuracy: %.2f'%classifier.score(test, y_test)) 

回答

0

对不起,关于那..我想通了。该错误是相当烦人的通知。我所要做的就是将它们作为列表投入,并将它们放入数据框中,而不是将它们转换为系列。 更改此

train['vecs'] = pd.Series(train_vecs.tolist()) 
val['vecs'] = pd.Series(test_vecs.tolist()) 

到:

train['vecs'] = list(train_vecs) 
val['vecs'] = list(test_vecs)