3

我试图用SCI-Kit了解0.17 我的数据看起来做多标记分类像Python的科幻Kit了解:多标记分类ValueError异常:无法将字符串转换为float:

培训

Col1     Col2 
asd dfgfg    [1,2,3] 
poioi oiopiop   [4] 

测试

Col1      
asdas gwergwger  
rgrgh hrhrh 

到目前为止我的代码

import numpy as np 
from sklearn import svm, datasets 
from sklearn.metrics import precision_recall_curve 
from sklearn.metrics import average_precision_score 
from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import label_binarize 
from sklearn.multiclass import OneVsRestClassifier 

def getLabels(): 
    traindf = pickle.load(open("train.pkl","rb")) 
    X = traindf['Col1'] 
    y = traindf['Col2'] 

    # Binarize the output 
    from sklearn.preprocessing import MultiLabelBinarizer 
    y=MultiLabelBinarizer().fit_transform(y)  

    random_state = np.random.RandomState(0) 


    # Split into training and test 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, 
                 random_state=random_state) 

    # Run classifier 
    from sklearn import svm, datasets 
    classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, 
            random_state=random_state)) 
    y_score = classifier.fit(X_train, y_train).decision_function(X_test) 

,但现在我得到

ValueError: could not convert string to float: <value of Col1 here> 

y_score = classifier.fit(X_train, y_train).decision_function(X_test) 

我一定要二值化X呢?为什么我需要将X维度转换为浮点数?

+0

“你似乎是使用传统的多标签数据表示序列的顺序。不再支持;而是使用二进制数组或稀疏矩阵。“ - 你看到了吗? –

+0

如何将我的标签转换为二进制数组? – AbtPst

+0

是不是[这](http://stackoverflow.com/questions/34213199/python-scikit-learn-multilabe-classification-valueerror-you-appear-to-be-usin)同样的问题? – erip

回答

4

是的,您必须将X转换为数字表示(不是必需的二进制)以及y。这是因为所有的机器学习方法都是以数字矩阵运算的。

如何准确地做到这一点?如果在Col1中每个样本可以在其中有不同的话(即它代表了一些文本) - 你可以改变该列与CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer 

col1 = ["cherry banana", "apple appricote", "cherry apple", "banana apple appricote cherry apple"] 

cv = CountVectorizer() 
cv.fit_transform(col1) 
#<4x4 sparse matrix of type '<class 'numpy.int64'>' 
# with 10 stored elements in Compressed Sparse Row format> 

cv.fit_transform(col1).toarray() 
#array([[0, 0, 1, 1], 
#  [1, 1, 0, 0], 
#  [1, 0, 0, 1], 
#  [2, 1, 1, 1]], dtype=int64) 
+0

谢谢!非常正确 – AbtPst

相关问题