0
我正在尝试将this example改编为我在SQL服务器数据库中的某些社交媒体数据。SciKit-从ODBC中学习文本分类
我故意强迫训练和测试集只有社交媒体帖子中包含“bunches”这个词。因此,当我运行所有算法时,我会期望这个词有非常高的f分数。相反,我得到了大约2-4%的f分数。我有一种感觉,我没有正确地将数据提供给算法。
from __future__ import print_function
import numpy as np
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
import pyodbc
import pprint
#local windows connection
train = []
db = pyodbc.connect('DRIVER={SQL Server Native Client 11.0};SERVER=SERVER_IP;DATABASE=DB_NAME;Trusted_Connection=Yes;')
cursor = db.cursor()
training_query = "SELECT top 2 percent postTitle FROM dbo.All_CH_Posts where monitorID ='1168136050' and postTitle like '%bunches%' ORDER BY NEWID()"
trainquery = cursor.execute(training_query)
traindata = cursor.fetchall()
for row in traindata:
train.extend(row)
test = []
test_query = "SELECT top 1 percent postTitle FROM dbo.All_CH_Posts where monitorID ='1168136050' and postTitle like '%bunches%' ORDER BY NEWID()"
testquery = cursor.execute(test_query)
testdata = cursor.fetchall()
for row in testdata:
test.extend(row)
print('traindata')
pp.pprint(traindata)
print('testdata')
pp.pprint(testdata)
print('data loaded')
# split a training set and a test set
y_train = train
y_test =test
print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
vectorizer = TfidfVectorizer(decode_error='ignore',sublinear_tf=True,
stop_words='english', lowercase=True, min_df=20)
X_train = vectorizer.fit_transform(train)
duration = time() - t0
print("Extracting features from the test dataset using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(test)
duration = time() - t0
print("n_samples: %d, n_features: %d" % X_test.shape)
feature_names = np.asarray(vectorizer.get_feature_names())
print(feature_names)
我故意设置min_df高,让一看是我的稀疏矩阵用什么词:
n_samples: 237, n_features: 26
['almonds' 'amp' 'best' 'bowl' 'box' 'bunches' 'cereal' 'cheerios' 'crunch'
'day' 'don' 'eat' 'eating' 'good' 'gt' 'honey' 'http' 'just' 'like' 'lol'
'love' 'miss' 'morning' 'oats' 'rt' 'want']
那我做错了吗?还是我以错误的方式思考这个问题/对文本分类有误解?
那么你的标签是什么?在代码中你设置了''y_train = train'',这看起来像你正在使用文本作为标签,我觉得很混乱。你计算哪个f分数?实际上,如果所有文本都具有共同的特征,则该特征不具有信息性,并且应该具有零(iirc)的f分数。 – 2014-11-05 00:13:31
添加到@AndreasMueller评论中,有**从示例代码中的训练集**部分加载一些类别,这意味着您需要指定分类的类别/标签。你可以检查'data_train.target_names'的值吗?这应该是您尝试分类的类的列表。 – 2014-11-05 08:04:16
@Guru好吧,我想我明白了。如果我要添加标签/类别,它们将是monitorID的。如果我的数据来自“select monitorID,postTitle from Table”,因为它将是一个元组列表,并且会导致vectorizor崩溃,所以我不清楚如何将标签连接到通过vectorizor的数据。 – dreyco676 2014-11-05 15:41:01