2016-02-19 54 views
3

我想将文档中的文本分类到不同的类别。每个文档只能进入以下类别之一:PR,AR,KID,SAR。使用scikit-learn来区分类似的类别

我发现使用scikit学习和我能够使用它的一个例子:

import numpy 
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 
from pandas import DataFrame 

def build_data_frame(path, classification): 
    rows = [] 
    index = [] 

    f = open(path, mode = 'r', encoding="utf8") 
    txt = f.read() 

    rows.append({'text': txt, 'class': classification}) 
    index.append(path) 

    data_frame = DataFrame(rows, index=index) 
    return data_frame 

# Categories 
PR = 'PR' 
AR = 'AR' 
KID = 'KID' 
SAR = 'SAR' 

# Training documents 
SOURCES = [ 
    (r'C:/temp_training/PR/PR1.txt', PR), 
    (r'C:/temp_training/PR/PR2.txt', PR), 
    (r'C:/temp_training/PR/PR3.txt', PR), 
    (r'C:/temp_training/PR/PR4.txt', PR), 
    (r'C:/temp_training/PR/PR5.txt', PR), 
    (r'C:/temp_training/AR/AR1.txt', AR), 
    (r'C:/temp_training/AR/AR2.txt', AR), 
    (r'C:/temp_training/AR/AR3.txt', AR), 
    (r'C:/temp_training/AR/AR4.txt', AR), 
    (r'C:/temp_training/AR/AR5.txt', AR), 
    (r'C:/temp_training/KID/KID1.txt', KID), 
    (r'C:/temp_training/KID/KID2.txt', KID), 
    (r'C:/temp_training/KID/KID3.txt', KID), 
    (r'C:/temp_training/KID/KID4.txt', KID), 
    (r'C:/temp_training/KID/KID5.txt', KID), 
    (r'C:/temp_training/SAR/SAR1.txt', SAR), 
    (r'C:/temp_training/SAR/SAR2.txt', SAR), 
    (r'C:/temp_training/SAR/SAR3.txt', SAR), 
    (r'C:/temp_training/SAR/SAR4.txt', SAR), 
    (r'C:/temp_training/SAR/SAR5.txt', SAR) 
] 

# Real documents 
TESTS = [ 
    (r'C:/temp_testing/PR/PR1.txt'), 
    (r'C:/temp_testing/PR/PR2.txt'), 
    (r'C:/temp_testing/PR/PR3.txt'), 
    (r'C:/temp_testing/PR/PR4.txt'), 
    (r'C:/temp_testing/PR/PR5.txt'), 
    (r'C:/temp_testing/AR/AR1.txt'), 
    (r'C:/temp_testing/AR/AR2.txt'), 
    (r'C:/temp_testing/AR/AR3.txt'), 
    (r'C:/temp_testing/AR/AR4.txt'), 
    (r'C:/temp_testing/AR/AR5.txt'), 
    (r'C:/temp_testing/KID/KID1.txt'), 
    (r'C:/temp_testing/KID/KID2.txt'), 
    (r'C:/temp_testing/KID/KID3.txt'), 
    (r'C:/temp_testing/KID/KID4.txt'), 
    (r'C:/temp_testing/KID/KID5.txt'), 
    (r'C:/temp_testing/SAR/SAR1.txt'), 
    (r'C:/temp_testing/SAR/SAR2.txt'), 
    (r'C:/temp_testing/SAR/SAR3.txt'), 
    (r'C:/temp_testing/SAR/SAR4.txt'), 
    (r'C:/temp_testing/SAR/SAR5.txt') 
] 

data_train = DataFrame({'text': [], 'class': []}) 
for path, classification in SOURCES: 
    data_train = data_train.append(build_data_frame(path, classification)) 

data_train = data_train.reindex(numpy.random.permutation(data_train.index)) 

examples = [] 

for path in TESTS: 
    f = open(path, mode = 'r', encoding = 'utf8') 
    txt = f.read() 

    examples.append(txt) 

target_names = [PR, AR, KID, SAR] 

classifier = Pipeline([ 
    ('vectorizer', CountVectorizer(ngram_range=(1, 2), analyzer='word', strip_accents='unicode', stop_words='english')), 
    ('tfidf', TfidfTransformer()), 
    ('clf', OneVsRestClassifier(LinearSVC()))]) 
classifier.fit(data_train['text'], data_train['class']) 
predicted = classifier.predict(examples) 

print(predicted) 

输出:

['PR' 'PR' 'PR' 'PR' 'PR' 'AR' 'AR' 'AR' 'AR' 'AR' 'KID' 'KID' 'KID' 'KID' 
'KID' 'AR' 'AR' 'AR' 'SAR' 'AR'] 

PR,AR和KID是完全认可。

但是,SAR文件(最后5个)没有正确分类,除了其中一个。 SAR和AR非常相似,这可以解释算法为什么会混淆。

我试图玩n-grams值,但1(min)和2(max)似乎给出了最好的结果。

  • 任何想法如何提高区分AR和SAR类别的精度?

  • 有没有办法显示特定文件的识别百分比?即PR(70%),这意味着该算法的预测70%的信心

如果你需要的文件,这里是集:http://1drv.ms/21dnL6j

回答

1

这不是严格意义上的编程问题,所以我建议您尝试将其发布到更多的数据科学相关堆栈。

反正有些事情你可以尝试:

  • 使用一些其他的分类。
  • 使用网格搜索调整分类器超参数。
  • 使用OneVsOne代替OneVsAll作为策略。这可能有助于您将SAR与AR区分开来。
  • 对于“显示特定文档的识别百分比”,您可以使用来自某些模型的概率输出。使用classifier.predict_proba函数代替classifier.predict之一。

祝你好运!