Python的科幻Kit了解：多标记分类ValueError异常：无法将字符串转换为float：

我试图用SCI-Kit了解0.17 我的数据看起来做多标记分类像Python的科幻Kit了解：多标记分类ValueError异常：无法将字符串转换为float：

培训

Col1     Col2 
asd dfgfg    [1,2,3] 
poioi oiopiop   [4]

测试

Col1      
asdas gwergwger  
rgrgh hrhrh

到目前为止我的代码

import numpy as np 
from sklearn import svm, datasets 
from sklearn.metrics import precision_recall_curve 
from sklearn.metrics import average_precision_score 
from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import label_binarize 
from sklearn.multiclass import OneVsRestClassifier 

def getLabels(): 
    traindf = pickle.load(open("train.pkl","rb")) 
    X = traindf['Col1'] 
    y = traindf['Col2'] 

    # Binarize the output 
    from sklearn.preprocessing import MultiLabelBinarizer 
    y=MultiLabelBinarizer().fit_transform(y)  

    random_state = np.random.RandomState(0) 


    # Split into training and test 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, 
                 random_state=random_state) 

    # Run classifier 
    from sklearn import svm, datasets 
    classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, 
            random_state=random_state)) 
    y_score = classifier.fit(X_train, y_train).decision_function(X_test)

，但现在我得到

ValueError: could not convert string to float: <value of Col1 here>

上

y_score = classifier.fit(X_train, y_train).decision_function(X_test)

我一定要二值化X呢？为什么我需要将X维度转换为浮点数？

来源

2015-12-14 AbtPst

“你似乎是使用传统的多标签数据表示序列的顺序。不再支持;而是使用二进制数组或稀疏矩阵。“ - 你看到了吗？ –

如何将我的标签转换为二进制数组？ – AbtPst

是不是[这]（http://stackoverflow.com/questions/34213199/python-scikit-learn-multilabe-classification-valueerror-you-appear-to-be-usin）同样的问题？ – erip

是的，您必须将X转换为数字表示（不是必需的二进制）以及y。这是因为所有的机器学习方法都是以数字矩阵运算的。

如何准确地做到这一点？如果在Col1中每个样本可以在其中有不同的话（即它代表了一些文本） - 你可以改变该列与CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer 

col1 = ["cherry banana", "apple appricote", "cherry apple", "banana apple appricote cherry apple"] 

cv = CountVectorizer() 
cv.fit_transform(col1) 
#<4x4 sparse matrix of type '<class 'numpy.int64'>' 
# with 10 stored elements in Compressed Sparse Row format> 

cv.fit_transform(col1).toarray() 
#array([[0, 0, 1, 1], 
#  [1, 1, 0, 0], 
#  [1, 0, 0, 1], 
#  [2, 1, 1, 1]], dtype=int64)

来源

2015-12-16 06:46:08

谢谢！非常正确 – AbtPst

Python的科幻Kit了解：多标记分类ValueError异常：无法将字符串转换为float：

回答

相关问题