2017-04-15 62 views
0
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 

你好,我有意见如下列表:为什么下面的部分贴合不工作属性?

comments = ['I am very agry','this is not interesting','I am very happy'] 

这些相应的标签:

sents = ['angry','indiferent','happy'] 

我使用TFIDF如下向量化这些评论:

tfidf_vectorizer = TfidfVectorizer(analyzer='word') 
tfidf = tfidf_vectorizer.fit_transform(comments) 
from sklearn import preprocessing 

我使用标签编码器来向量化标签:

le = preprocessing.LabelEncoder() 
le.fit(sents) 
labels = le.transform(sents) 
print(labels.shape) 
from sklearn.linear_model import PassiveAggressiveClassifier 
from sklearn.model_selection import train_test_split 
with open('tfidf.pickle','wb') as idxf: 
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL) 
with open('tfidf_vectorizer.pickle','wb') as idxf: 
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL) 

这里我使用的被动攻击,以适应型号:

clf2 = PassiveAggressiveClassifier() 


with open('passive.pickle','wb') as idxf: 
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL) 

with open('passive.pickle', 'rb') as infile: 
    clf2 = pickle.load(infile) 

with open('tfidf_vectorizer.pickle', 'rb') as infile: 
    tfidf_vectorizer = pickle.load(infile) 
with open('tfidf.pickle', 'rb') as infile: 
    tfidf = pickle.load(infile) 

在这里,我想测试部分配合使用有三个新的注释及其相应的标签如下:

new_comments = ['I love the life','I hate you','this is not important'] 
new_labels = [1,0,2] 
vec_new_comments = tfidf_vectorizer.transform(new_comments) 

print(clf2.predict(vec_new_comments)) 
clf2.partial_fit(vec_new_comments, new_labels) 

问题是我没有得到正确的结果后部分适合如下:

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??') 
print(clf2.predict(vec_new_comments)) 

但是我得到这样的输出:

[2 2 2] 

所以我真的很感谢支持发现,为什么如果我用同样的例子,它已用来进行培训,测试它的模型没有被更新期望的输出应该是:

[1,0,2] 

我想赞赏支持,也许超参数看到所需的输出。

这是完整的代码,以显示部分适合:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 
import sys 
from sklearn.metrics.pairwise import cosine_similarity 
import random 


comments = ['I am very agry','this is not interesting','I am very happy'] 
sents = ['angry','indiferent','happy'] 
tfidf_vectorizer = TfidfVectorizer(analyzer='word') 
tfidf = tfidf_vectorizer.fit_transform(comments) 
#print(tfidf.shape) 
from sklearn import preprocessing 
le = preprocessing.LabelEncoder() 
le.fit(sents) 
labels = le.transform(sents) 

from sklearn.linear_model import PassiveAggressiveClassifier 
from sklearn.model_selection import train_test_split 
with open('tfidf.pickle','wb') as idxf: 
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL) 
with open('tfidf_vectorizer.pickle','wb') as idxf: 
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL) 

clf2 = PassiveAggressiveClassifier() 

clf2.fit(tfidf, labels) 


with open('passive.pickle','wb') as idxf: 
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL) 

with open('passive.pickle', 'rb') as infile: 
    clf2 = pickle.load(infile) 



with open('tfidf_vectorizer.pickle', 'rb') as infile: 
    tfidf_vectorizer = pickle.load(infile) 
with open('tfidf.pickle', 'rb') as infile: 
    tfidf = pickle.load(infile) 

new_comments = ['I love the life','I hate you','this is not important'] 
new_labels = [1,0,2] 

vec_new_comments = tfidf_vectorizer.transform(new_comments) 

clf2.partial_fit(vec_new_comments, new_labels) 



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??') 
print(clf2.predict(vec_new_comments)) 

不过我:

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2?? 
[2 2 2] 
+0

你是如何装修的'clf2'。请将整个代码作为一个代码片段发布。现在它非常恼人的复制粘贴一次又一次。 –

+0

@VivekKumar我已经更新了这个问题,我添加了完整的代码来重现我的问题,感谢支持 – neo33

回答

1

那么有多个问题与您的代码。我将首先说明一些更为复杂的问题:

  1. 在获知任何东西之前,您正在酸洗clf2。 (即,一旦它被定义就会腌制它,它不起任何作用)。如果你只是测试,那很好。否则,他们应在fit()或同等电话后腌制。
  2. 您在致电clf2.partial_fit()之前致电clf2.fit()。这打破了partial_fit()的整个目的。当您致电fit()时,您基本上修复了模型将了解的类(标签)。在你的情况下,这是可以接受的,因为随后你打电话给partial_fit()你给出了相同的标签。但这仍不是一个好习惯。

    See this for more details

    在partial_fit()的情况下,不调用fit()如初。始终致电partial_fit()与您的起始数据和新的数据。但请确保您在参数classes中首次调用parital_fit()时提供了模型要学习的所有标签。

  3. 现在最后一部分,关于你的tfidf_vectorizer。您可以拨打fit_transform()(实质上是fit(),然后是transformed()合并)tfidf_vectorizercomments阵列。这意味着在随后调用transform()时(如您在transform(new_comments)中所做的那样),它不会从new_comments中学习新单词,而只会使用它在调用fit()期间看到的单词(comments中的单词)。

    同样适用于LabelEncodersents

    在线学习场景中,这又是不可预见的。您应该立即安装所有可用的数据。但是,由于您试图使用partial_fit(),我们假设您有一个非常大的数据集,它可能不适合一次存储在内存中。所以你也想将某种partial_fit应用于TfidfVectorizer。但TfidfVectorizer不支持partial_fit()。实际上它并不适用于大数据。所以你需要改变你的方法。请参阅以下问题了解更多详情: -

所有的事情放在一边,如果你改变只是在装修的整个数据(commentsnew_comments的TFIDF部分一次),你会得到你想要的结果。

请参见下面的代码修改(我可能已经举办了一点,并更名为vec_new_commentsnew_tfidf,请通过它与关注):

comments = ['I am very agry','this is not interesting','I am very happy'] 
sents = ['angry','indiferent','happy'] 

new_comments = ['I love the life','I hate you','this is not important'] 
new_sents = ['happy','angry','indiferent'] 

tfidf_vectorizer = TfidfVectorizer(analyzer='word') 
le = preprocessing.LabelEncoder() 

# The below lines are important 

# I have given the whole data to fit in tfidf_vectorizer 
tfidf_vectorizer.fit(comments + new_comments) 

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same 
# le.fit(sents) 
le.fit(sents + new_sents) 

下面是不那么首选代码(您使用,以及我在第2点谈到的内容),但只要您进行上述更改,结果就会很好。

tfidf = tfidf_vectorizer.transform(comments) 
labels = le.transform(sents) 

clf2.fit(tfidf, labels) 
print(clf2.predict(tfidf)) 
# [0 2 1] 

new_tfidf = tfidf_vectorizer.transform(new_comments) 
new_labels = le.transform(new_sents) 

clf2.partial_fit(new_tfidf, new_labels) 
print(clf2.predict(new_tfidf)) 
# [1 0 2]  As you wanted 

正确的做法,或partial_fit方式()拟使用:

# Declare all labels that you want the model to learn 
# Using classes learnt by labelEncoder for this 
# In any calls to `partial_fit()`, all labels should be from this array only 

all_classes = le.transform(le.classes_) 

# Notice the parameter classes here 
# It needs to present first time 
clf2.partial_fit(tfidf, labels, classes=all_classes) 
print(clf2.predict(tfidf)) 
# [0 2 1] 

# classes is not present here 
clf2.partial_fit(new_tfidf, new_labels) 
print(clf2.predict(new_tfidf)) 
# [1 0 2] 
+0

非常感谢支持我终于克服了这种情况 – neo33

相关问题