用n元组分类

我想使用使用n元组特征的sklearn分类器。此外，我想进行交叉验证以找出n-gram的最佳顺序。然而，我有点卡住我如何能够把所有的东西放在一起。用n元组分类

现在，我有以下代码：

import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import KFold 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 

text = ... # This is the input text. A list of strings 
labels = ... # These are the labels of each sentence 
# Find the optimal order of the ngrams by cross-validation 
scores = pd.Series(index=range(1,6), dtype=float) 
folds = KFold(n_splits=3) 

for n in range(1,6): 
    count_vect = CountVectorizer(ngram_range=(n,n), stop_words='english') 
    X = count_vect.fit_transform(text) 
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42) 
    clf = MultinomialNB() 
    score = cross_val_score(clf, X_train, y_train, cv=folds, n_jobs=-1) 
    scores.loc[n] = np.mean(score) 

# Evaluate the classifier using the best order found 
order = scores.idxmax() 
count_vect = CountVectorizer(ngram_range=(order,order), stop_words='english') 
X = count_vect.fit_transform(text) 
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42) 
clf = MultinomialNB() 
clf = clf.fit(X_train, y_train) 
acc = clf.score(X_test, y_test) 
print('Accuracy is {}'.format(acc))

不过，我觉得这是错误的方式做到这一点，因为我创造的每一个循环列车测试分裂。

如果做的列车测试预先分割并分别应用到CountVectorizer两个部分，除了这些部分具有不同shape s表示，采用clf.fit和clf.score时会引起问题。

我该如何解决这个问题？

编辑：如果我尝试先建立一个词汇，我还是要多建几个词汇，由于对unigram的词汇是从二元语法的不同，等

举个例子：

# unigram vocab 
vocab = set() 
for sentence in text: 
    for word in sentence: 
     if word not in vocab: 
      vocab.add(word) 
len(vocab) # 47291 

# bigram vocab 
vocab = set() 
for sentence in text: 
    bigrams = nltk.ngrams(sentence, 2) 
    for bigram in bigrams: 
     if bigram not in vocab: 
      vocab.add(bigram) 
len(vocab) # 326044

这再一次导致我需要为每个n-gram大小应用CountVectorizer的相同问题。

来源

2017-06-02 JNevens

构建的词汇首先，从训练集。没有什么能够阻止你把这两个单词和bigrams（以及更多）放在同一个字典中。 – alexis

您需要先设置vocabulary参数。在某些方面，你必须提供整个词汇，否则维度永远不会匹配（显然）。如果您先进行火车/测试划分，则可能会出现一组中不存在的单词，并且会导致尺寸不匹配。

The documentation说：

如果你不能提供一个先验字典，你不使用的分析，做某种特征选择则的特征数量将等于找到了词汇量做通过分析数据。

再往下看，你会发现对vocabulary的描述。

词汇：
映射或可迭代，可选无论是映射（例如，一个字典），其中键是术语和值在特征矩阵索引，或可迭代以上条款。如果没有给出，则从输入文件中确定词汇。映射中的指数不应该重复，并且不应该在0和最大指数之间有任何差距。

来源

2017-06-02 17:21:22 displayname

好吧，我会做以下事情。我得到'text'中的所有单词列表，这是'vocab'。然后，我可以使用'text'和'labels'进行火车测试分割。之后，我可以在这些单独的部件上执行'CountVectorizer'，同时将'vocabulary'参数设置为'vocab'。正确？ – JNevens

@JNevens是的，这应该工作。最后，您的* n *维中每个单词的特征向量，其中* n *是整个语料库中单词的数量。您的模型将接受* n维向量的训练，这意味着您无法以某种方式更改维度的数量 - 您的模型应如何分类* m *维模型？ – displayname

正如我的问题所述，我想尝试使用不同的n-gram顺序的分类器。因此，如果我使用1克或2克的'CountVectorizer'，这些词汇大小又有所不同，因为前者的所有词汇都是词汇，而后者的所有词汇都是词汇。 – JNevens

用n元组分类

回答

相关问题