2014-11-04 117 views
0

我正在尝试将this example改编为我在SQL服务器数据库中的某些社交媒体数据。SciKit-从ODBC中学习文本分类

我故意强迫训练和测试集只有社交媒体帖子中包含“bunches”这个词。因此,当我运行所有算法时,我会期望这个词有非常高的f分数。相反,我得到了大约2-4%的f分数。我有一种感觉,我没有正确地将数据提供给算法。

from __future__ import print_function 
import numpy as np 
from time import time 
from sklearn.feature_extraction.text import TfidfVectorizer 
import pyodbc 
import pprint 

#local windows connection 
train = [] 
db = pyodbc.connect('DRIVER={SQL Server Native Client 11.0};SERVER=SERVER_IP;DATABASE=DB_NAME;Trusted_Connection=Yes;') 
cursor = db.cursor() 
training_query = "SELECT top 2 percent postTitle FROM dbo.All_CH_Posts where monitorID ='1168136050' and postTitle like '%bunches%' ORDER BY NEWID()" 
trainquery = cursor.execute(training_query) 
traindata = cursor.fetchall() 
for row in traindata: 
    train.extend(row) 

test = [] 
test_query = "SELECT top 1 percent postTitle FROM dbo.All_CH_Posts where monitorID ='1168136050' and postTitle like '%bunches%' ORDER BY NEWID()" 
testquery = cursor.execute(test_query) 
testdata = cursor.fetchall() 
for row in testdata: 
    test.extend(row) 
print('traindata') 
pp.pprint(traindata) 
print('testdata') 
pp.pprint(testdata) 
print('data loaded') 

# split a training set and a test set 
y_train = train 
y_test =test 


print("Extracting features from the training dataset using a sparse vectorizer") 
t0 = time() 
vectorizer = TfidfVectorizer(decode_error='ignore',sublinear_tf=True, 
          stop_words='english', lowercase=True, min_df=20) 
X_train = vectorizer.fit_transform(train) 
duration = time() - t0 

print("Extracting features from the test dataset using the same vectorizer") 
t0 = time() 
X_test = vectorizer.transform(test) 
duration = time() - t0 
print("n_samples: %d, n_features: %d" % X_test.shape) 

feature_names = np.asarray(vectorizer.get_feature_names()) 
print(feature_names) 

我故意设置min_df高,让一看是我的稀疏矩阵用什么词:

n_samples: 237, n_features: 26 
['almonds' 'amp' 'best' 'bowl' 'box' 'bunches' 'cereal' 'cheerios' 'crunch' 
'day' 'don' 'eat' 'eating' 'good' 'gt' 'honey' 'http' 'just' 'like' 'lol' 
'love' 'miss' 'morning' 'oats' 'rt' 'want'] 

那我做错了吗?还是我以错误的方式思考这个问题/对文本分类有误解?

Here is my training set.

Here is my test set.

+1

那么你的标签是什么?在代码中你设置了''y_train = train'',这看起来像你正在使用文本作为标签,我觉得很混乱。你计算哪个f分数?实际上,如果所有文本都具有共同的特征,则该特征不具有信息性,并且应该具有零(iirc)的f分数。 – 2014-11-05 00:13:31

+0

添加到@AndreasMueller评论中,有**从示例代码中的训练集**部分加载一些类别,这意味着您需要指定分类的类别/标签。你可以检查'data_train.target_names'的值吗?这应该是您尝试分类的类的列表。 – 2014-11-05 08:04:16

+0

@Guru好吧,我想我明白了。如果我要添加标签/类别,它们将是monitorID的。如果我的数据来自“select monitorID,postTitle from Table”,因为它将是一个元组列表,并且会导致vectorizo​​r崩溃,所以我不清楚如何将标签连接到通过vectorizo​​r的数据。 – dreyco676 2014-11-05 15:41:01

回答

2

感谢@AndreasMueller和@Guru。问题出在我的标签上。

解决方案是为每一行创建标签。

training_query = "SELECT top 2 percent monitorID, postTitle FROM dbo.All_CH_Posts where monitorID in ('1168136050','469407080') and postTitle <>'' ORDER BY NEWID()" 
trainquery = cursor.execute(training_query) 
traindata = cursor.fetchall() 
for row in traindata: 
    train_data.append(row.postTitle) 
    train_target.append(row.monitorID) 

test_data = [] 
test_target = [] 
test_query = "SELECT top 2 percent monitorID, postTitle FROM dbo.All_CH_Posts where monitorID in ('1168136050','469407080') and postTitle <>'' ORDER BY NEWID()" 
testquery = cursor.execute(test_query) 
testdata = cursor.fetchall() 
for row in testdata: 
    test_data.append(row.postTitle) 
    test_target.append(row.monitorID) 

print("data loaded") 


#assigning labels 
train_le = preprocessing.LabelEncoder() 
y_train = train_le.fit_transform(train_target) 

test_le = preprocessing.LabelEncoder() 
y_test = test_le.fit_transform(test_target)