0

我正在尝试使用朴素贝叶斯文本分类器的文本分类。 我的数据是以下格式,并根据问题和摘录我必须决定问题的主题。培训数据有超过20K条记录。我知道SVM在这里会更好,但我想用Naive Bayes using sklearn library如何使用sklearn库进行朴素贝叶斯文本分类?

{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png). \n\nWhat is the effective capacitance of this circuit and will the ...\r\n  "}, 
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n  "}]} 

这是我到目前为止已经试过,

import numpy as np 
import json 
from sklearn.naive_bayes import * 

topic = [] 
question = [] 
excerpt = [] 

with open('training.json') as f: 
    for line in f: 
     data = json.loads(line) 
     topic.append(data["topic"]) 
     question.append(data["question"]) 
     excerpt.append(data["excerpt"]) 

unique_topics = list(set(topic)) 
new_topic = [x.encode('UTF8') for x in topic] 
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic] 
numeric_topics = [float(i) for i in numeric_topics] 

x1 = np.array(question) 
x2 = np.array(excerpt) 
X = zip(*[x1,x2]) 
Y = np.array(numeric_topics) 
print X[0] 
clf = BernoulliNB() 
clf.fit(X, Y) 
print "Prediction:", clf.predict(['hello']) 

但作为预期我得到ValueError异常:无法将字符串转换为浮动。我的问题是如何创建一个简单的分类器来分类相关主题的问题和摘录?

回答

4

sklearn中的所有分类器都需要将输入表示为某个固定维度的向量。对于文本有CountVectorizer,HashingVectorizerTfidfVectorizer它可以将您的字符串转换为浮动数字的向量。

vect = TfidfVectorizer() 
X = vect.fit_transform(X) 

很显然,你需要向量化的测试集以同样的方式

clf.predict(vect.transform(['hello'])) 

看到一个tutorial on using sklearn with textual data

+0

我得到错误AttributeError:'元组'对象没有属性'低',而使用X = vect.fit_transform(X),X是一个迭代列表。 –

+0

这是一个numpy数组问题。我修好了它。非常感谢您的帮助.. –