2017-04-04 116 views
0

我试图训练分类器的推文。然而,问题在于它说分类器具有100%的准确性,并且最丰富的特征列表不显示任何内容。有谁知道我做错了什么?我相信我对分类器的所有输入都是正确的,所以我不知道它出错的地方。NLTK朴素贝叶斯分类器培训问题

这是我使用的数据集: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

这是我的代码:

import nltk 
import random 

file = open('Train/train.txt', 'r') 


documents = [] 
all_words = []   #TODO remove punctuation? 
INPUT_TWEETS = 3000 

print("Preprocessing...") 
for line in (file): 

    # Tokenize Tweet content 
    tweet_words = nltk.word_tokenize(line[2:]) 

    sentiment = "" 
    if line[0] == 0: 
     sentiment = "negative" 
    else: 
     sentiment = "positive" 
    documents.append((tweet_words, sentiment)) 

    for word in tweet_words: 
     all_words.append(word.lower()) 

    INPUT_TWEETS = INPUT_TWEETS - 1 
    if INPUT_TWEETS == 0: 
     break 

random.shuffle(documents) 


all_words = nltk.FreqDist(all_words) 

word_features = list(all_words.keys())[:3000] #top 3000 words 

def find_features(document): 
    words = set(document) 
    features = {} 
    for w in word_features: 
     features[w] = (w in words) 

    return features 

#Categorize as positive or Negative 
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents] 


training_set = feature_set[:1000] 
testing_set = feature_set[1000:] 

print("Training...") 
classifier = nltk.NaiveBayesClassifier.train(training_set) 

print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100) 
classifier.show_most_informative_features(15) 
+1

貌似问题是在'行中的[0]'用''int'比较0'。我怀疑你的输入实际上使用空字节来表示负面情绪。 – alexis

回答

1

。在你的代码一个错字:

feature_set = [(find_features(all_words ),情绪)for(all_words,endentment)in documents]

This ca使用sentiment始终具有相同的值(即预处理步骤中最后一条推文的值),因此培训毫无意义,并且所有功能都无关紧要。

修复它,你将获得:

('Naive Bayes Accuracy:', 66.75) 
Most Informative Features 
        -- = True   positi : negati =  6.9 : 1.0 
       these = True   positi : negati =  5.6 : 1.0 
       face = True   positi : negati =  5.6 : 1.0 
       saw = True   positi : negati =  5.6 : 1.0 
        ] = True   positi : negati =  4.4 : 1.0 
       later = True   positi : negati =  4.4 : 1.0 
       love = True   positi : negati =  4.1 : 1.0 
        ta = True   positi : negati =  4.0 : 1.0 
       quite = True   positi : negati =  4.0 : 1.0 
       trying = True   positi : negati =  4.0 : 1.0 
       small = True   positi : negati =  4.0 : 1.0 
       thx = True   positi : negati =  4.0 : 1.0 
       music = True   positi : negati =  4.0 : 1.0 
        p = True   positi : negati =  4.0 : 1.0 
      husband = True   positi : negati =  4.0 : 1.0 
+0

我改变了错字,但我的输出没有改变它仍然是100%,并没有显示功能 –

+0

那么也许你的train.txt已损坏/不完整?我使用'df = pd.read_csv('Sentiment Analysis Dataset.csv',error_bad_lines = False,encoding ='utf-8')将原始数据读入DataFrame中,并使用'df.iterrows()'遍历行。得到粘贴在上面的输出。 – acidtobi

+0

你能告诉我阅读.csv的整个代码吗? –