为什么下面的部分贴合不工作属性？

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

你好，我有意见如下列表：为什么下面的部分贴合不工作属性？

comments = ['I am very agry','this is not interesting','I am very happy']

这些相应的标签：

sents = ['angry','indiferent','happy']

我使用TFIDF如下向量化这些评论：

tfidf_vectorizer = TfidfVectorizer(analyzer='word') 
tfidf = tfidf_vectorizer.fit_transform(comments) 
from sklearn import preprocessing

我使用标签编码器来向量化标签：

le = preprocessing.LabelEncoder() 
le.fit(sents) 
labels = le.transform(sents) 
print(labels.shape) 
from sklearn.linear_model import PassiveAggressiveClassifier 
from sklearn.model_selection import train_test_split 
with open('tfidf.pickle','wb') as idxf: 
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL) 
with open('tfidf_vectorizer.pickle','wb') as idxf: 
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

这里我使用的被动攻击，以适应型号：

clf2 = PassiveAggressiveClassifier() 


with open('passive.pickle','wb') as idxf: 
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL) 

with open('passive.pickle', 'rb') as infile: 
    clf2 = pickle.load(infile) 

with open('tfidf_vectorizer.pickle', 'rb') as infile: 
    tfidf_vectorizer = pickle.load(infile) 
with open('tfidf.pickle', 'rb') as infile: 
    tfidf = pickle.load(infile)

在这里，我想测试部分配合使用有三个新的注释及其相应的标签如下：

new_comments = ['I love the life','I hate you','this is not important'] 
new_labels = [1,0,2] 
vec_new_comments = tfidf_vectorizer.transform(new_comments) 

print(clf2.predict(vec_new_comments)) 
clf2.partial_fit(vec_new_comments, new_labels)

问题是我没有得到正确的结果后部分适合如下：

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??') 
print(clf2.predict(vec_new_comments))

但是我得到这样的输出：

[2 2 2]

所以我真的很感谢支持发现，为什么如果我用同样的例子，它已用来进行培训，测试它的模型没有被更新期望的输出应该是：

[1,0,2]

我想赞赏支持，也许超参数看到所需的输出。

这是完整的代码，以显示部分适合：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 
import sys 
from sklearn.metrics.pairwise import cosine_similarity 
import random 


comments = ['I am very agry','this is not interesting','I am very happy'] 
sents = ['angry','indiferent','happy'] 
tfidf_vectorizer = TfidfVectorizer(analyzer='word') 
tfidf = tfidf_vectorizer.fit_transform(comments) 
#print(tfidf.shape) 
from sklearn import preprocessing 
le = preprocessing.LabelEncoder() 
le.fit(sents) 
labels = le.transform(sents) 

from sklearn.linear_model import PassiveAggressiveClassifier 
from sklearn.model_selection import train_test_split 
with open('tfidf.pickle','wb') as idxf: 
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL) 
with open('tfidf_vectorizer.pickle','wb') as idxf: 
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL) 

clf2 = PassiveAggressiveClassifier() 

clf2.fit(tfidf, labels) 


with open('passive.pickle','wb') as idxf: 
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL) 

with open('passive.pickle', 'rb') as infile: 
    clf2 = pickle.load(infile) 



with open('tfidf_vectorizer.pickle', 'rb') as infile: 
    tfidf_vectorizer = pickle.load(infile) 
with open('tfidf.pickle', 'rb') as infile: 
    tfidf = pickle.load(infile) 

new_comments = ['I love the life','I hate you','this is not important'] 
new_labels = [1,0,2] 

vec_new_comments = tfidf_vectorizer.transform(new_comments) 

clf2.partial_fit(vec_new_comments, new_labels) 



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??') 
print(clf2.predict(vec_new_comments))

不过我：

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2?? 
[2 2 2]

来源

2017-04-15 neo33

你是如何装修的'clf2'。请将整个代码作为一个代码片段发布。现在它非常恼人的复制粘贴一次又一次。 –

@VivekKumar我已经更新了这个问题，我添加了完整的代码来重现我的问题，感谢支持 – neo33

那么有多个问题与您的代码。我将首先说明一些更为复杂的问题：

在获知任何东西之前，您正在酸洗clf2。（即，一旦它被定义就会腌制它，它不起任何作用）。如果你只是测试，那很好。否则，他们应在fit()或同等电话后腌制。
您在致电clf2.partial_fit()之前致电clf2.fit()。这打破了partial_fit()的整个目的。当您致电fit()时，您基本上修复了模型将了解的类（标签）。在你的情况下，这是可以接受的，因为随后你打电话给partial_fit()你给出了相同的标签。但这仍不是一个好习惯。

See this for more details

在partial_fit（）的情况下，不调用fit()如初。始终致电partial_fit()与您的起始数据和新的数据。但请确保您在参数classes中首次调用parital_fit（）时提供了模型要学习的所有标签。
现在最后一部分，关于你的tfidf_vectorizer。您可以拨打fit_transform()（实质上是fit()，然后是transformed()合并）tfidf_vectorizer与comments阵列。这意味着在随后调用transform()时（如您在transform(new_comments)中所做的那样），它不会从new_comments中学习新单词，而只会使用它在调用fit()期间看到的单词（comments中的单词）。

同样适用于LabelEncoder和sents。

在线学习场景中，这又是不可预见的。您应该立即安装所有可用的数据。但是，由于您试图使用partial_fit()，我们假设您有一个非常大的数据集，它可能不适合一次存储在内存中。所以你也想将某种partial_fit应用于TfidfVectorizer。但TfidfVectorizer不支持partial_fit()。实际上它并不适用于大数据。所以你需要改变你的方法。请参阅以下问题了解更多详情： -
- Updating the feature names into scikit TFIdfVectorizer
- How can i reduce memory usage of Scikit-Learn Vectorizers?

所有的事情放在一边，如果你改变只是在装修的整个数据（comments和new_comments的TFIDF部分一次），你会得到你想要的结果。

请参见下面的代码修改（我可能已经举办了一点，并更名为vec_new_comments到new_tfidf，请通过它与关注）：

comments = ['I am very agry','this is not interesting','I am very happy'] 
sents = ['angry','indiferent','happy'] 

new_comments = ['I love the life','I hate you','this is not important'] 
new_sents = ['happy','angry','indiferent'] 

tfidf_vectorizer = TfidfVectorizer(analyzer='word') 
le = preprocessing.LabelEncoder() 

# The below lines are important 

# I have given the whole data to fit in tfidf_vectorizer 
tfidf_vectorizer.fit(comments + new_comments) 

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same 
# le.fit(sents) 
le.fit(sents + new_sents)

下面是不那么首选代码（您使用，以及我在第2点谈到的内容），但只要您进行上述更改，结果就会很好。

tfidf = tfidf_vectorizer.transform(comments) 
labels = le.transform(sents) 

clf2.fit(tfidf, labels) 
print(clf2.predict(tfidf)) 
# [0 2 1] 

new_tfidf = tfidf_vectorizer.transform(new_comments) 
new_labels = le.transform(new_sents) 

clf2.partial_fit(new_tfidf, new_labels) 
print(clf2.predict(new_tfidf)) 
# [1 0 2]  As you wanted

正确的做法，或partial_fit方式（）拟使用：

# Declare all labels that you want the model to learn 
# Using classes learnt by labelEncoder for this 
# In any calls to `partial_fit()`, all labels should be from this array only 

all_classes = le.transform(le.classes_) 

# Notice the parameter classes here 
# It needs to present first time 
clf2.partial_fit(tfidf, labels, classes=all_classes) 
print(clf2.predict(tfidf)) 
# [0 2 1] 

# classes is not present here 
clf2.partial_fit(new_tfidf, new_labels) 
print(clf2.predict(new_tfidf)) 
# [1 0 2]

来源

2017-04-15 06:14:19

非常感谢支持我终于克服了这种情况 – neo33

为什么下面的部分贴合不工作属性？

回答

相关问题