如何使用scikit向量化带标签的bigrams？

我自学的是如何使用scikit-learn，我决定用自己的语料库开始second task。我的手得到了一些二元语法，让我们说：如何使用scikit向量化带标签的bigrams？

training_data = [[('this', 'is'), ('is', 'a'),('a', 'text'), 'POS'], 
[('and', 'one'), ('one', 'more'), 'NEG'] 
[('and', 'other'), ('one', 'more'), 'NEU']]

我想向量化他们是很好的可以通过scikit学习提供了一些分类算法填写的格式（SVC，多项式的朴素贝叶斯等）。这是我的尝试：

from sklearn.feature_extraction.text import CountVectorizer 

count_vect = CountVectorizer(analyzer='word') 

X = count_vect.transform(((' '.join(x) for x in sample) 
        for sample in training_data)) 

print X.toarray()

这样做的问题是，我不知道如何处理的标签（即'POS', 'NEG', 'NEU'），我是否需要“矢量化”的标签，也为了打发training_data到分类算法，或者我可以让它像'POS'或任何其他类型的字符串？另一个问题是，我得到这个：

raise ValueError("Vocabulary wasn't fitted or is empty!") 
ValueError: Vocabulary wasn't fitted or is empty!

所以，我怎么能向量化二元语法像training_data。我也读到dictvectorizer和Sklearn-pandas，你们认为使用它们可能会更好地解决这个问题吗？

来源

2014-12-13 tumbleweed

它应该是这样的：

>>> training_data = [[('this', 'is'), ('is', 'a'),('a', 'text'), 'POS'], 
       [('and', 'one'), ('one', 'more'), 'NEG'], 
       [('and', 'other'), ('one', 'more'), 'NEU']] 
>>> count_vect = CountVectorizer(preprocessor=lambda x:x, 
           tokenizer=lambda x:x) 
>>> X = count_vect.fit_transform(doc[:-1] for doc in training_data) 

>>> print count_vect.vocabulary_ 
{('and', 'one'): 1, ('a', 'text'): 0, ('is', 'a'): 3, ('and', 'other'): 2, ('this', 'is'): 5, ('one', 'more'): 4} 
>>> print X.toarray() 
[[1 0 0 1 0 1] 
[0 1 0 0 1 0] 
[0 0 1 0 1 0]]

然后把你的标签在目标变量：

y = [doc[-1] for doc in training_data] # ['POS', 'NEG', 'NEU']

现在，你可以训练一个模型：

model = SVC() 
model.fit(X, y)

来源

2014-12-13 03:02:05 elyase

我其实一直用这种方式来登记标签。问题是我有一个更大的bigrams列表，它看起来不清楚Scikit如何使用标签来学习和预测一些结果。是否有另一种python的方式来设置标签，而不是逐行执行？谢谢！ – tumbleweed 2014-12-13 03:07:13

是的，更新了我的答案，还修复了'CountVectorizer'调用，以便它不会预处理或标记您的bigrams。 – elyase 2014-12-13 03:12:58

你的代码有几个小错误，我建议你打开一个新的问题，关于你现在得到的错误和你将要得到的错误（提示：比较你的代码为我的标签'y'） – elyase 2014-12-13 14:36:33

如何使用scikit向量化带标签的bigrams？

回答

相关问题