NLTK朴素贝叶斯分类器培训问题

我试图训练分类器的推文。然而，问题在于它说分类器具有100％的准确性，并且最丰富的特征列表不显示任何内容。有谁知道我做错了什么？我相信我对分类器的所有输入都是正确的，所以我不知道它出错的地方。NLTK朴素贝叶斯分类器培训问题

这是我使用的数据集： http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

这是我的代码：

import nltk 
import random 

file = open('Train/train.txt', 'r') 


documents = [] 
all_words = []   #TODO remove punctuation? 
INPUT_TWEETS = 3000 

print("Preprocessing...") 
for line in (file): 

    # Tokenize Tweet content 
    tweet_words = nltk.word_tokenize(line[2:]) 

    sentiment = "" 
    if line[0] == 0: 
     sentiment = "negative" 
    else: 
     sentiment = "positive" 
    documents.append((tweet_words, sentiment)) 

    for word in tweet_words: 
     all_words.append(word.lower()) 

    INPUT_TWEETS = INPUT_TWEETS - 1 
    if INPUT_TWEETS == 0: 
     break 

random.shuffle(documents) 


all_words = nltk.FreqDist(all_words) 

word_features = list(all_words.keys())[:3000] #top 3000 words 

def find_features(document): 
    words = set(document) 
    features = {} 
    for w in word_features: 
     features[w] = (w in words) 

    return features 

#Categorize as positive or Negative 
feature_set = [(find_features(all_words), sentiment) for (all_words, sentment) in documents] 


training_set = feature_set[:1000] 
testing_set = feature_set[1000:] 

print("Training...") 
classifier = nltk.NaiveBayesClassifier.train(training_set) 

print("Naive Bayes Accuracy:", (nltk.classify.accuracy(classifier,testing_set))*100) 
classifier.show_most_informative_features(15)

来源

2017-04-04 Daniel Medina

貌似问题是在'行中的[0]'用''int'比较0'。我怀疑你的输入实际上使用空字节来表示负面情绪。 – alexis

。在你的代码一个错字：

feature_set = [（find_features（all_words ），情绪）for（all_words，endentment）in documents]

This ca使用sentiment始终具有相同的值（即预处理步骤中最后一条推文的值），因此培训毫无意义，并且所有功能都无关紧要。

修复它，你将获得：

('Naive Bayes Accuracy:', 66.75) 
Most Informative Features 
        -- = True   positi : negati =  6.9 : 1.0 
       these = True   positi : negati =  5.6 : 1.0 
       face = True   positi : negati =  5.6 : 1.0 
       saw = True   positi : negati =  5.6 : 1.0 
        ] = True   positi : negati =  4.4 : 1.0 
       later = True   positi : negati =  4.4 : 1.0 
       love = True   positi : negati =  4.1 : 1.0 
        ta = True   positi : negati =  4.0 : 1.0 
       quite = True   positi : negati =  4.0 : 1.0 
       trying = True   positi : negati =  4.0 : 1.0 
       small = True   positi : negati =  4.0 : 1.0 
       thx = True   positi : negati =  4.0 : 1.0 
       music = True   positi : negati =  4.0 : 1.0 
        p = True   positi : negati =  4.0 : 1.0 
      husband = True   positi : negati =  4.0 : 1.0

来源

2017-04-04 20:30:29 acidtobi

我改变了错字，但我的输出没有改变它仍然是100％，并没有显示功能 –

那么也许你的train.txt已损坏/不完整？我使用'df = pd.read_csv（'Sentiment Analysis Dataset.csv'，error_bad_lines = False，encoding ='utf-8'）将原始数据读入DataFrame中，并使用'df.iterrows（）'遍历行。得到粘贴在上面的输出。 – acidtobi

你能告诉我阅读.csv的整个代码吗？ –

NLTK朴素贝叶斯分类器培训问题

回答

相关问题