2015-04-06 62 views
0

我想知道,如何训练SVM,将整个文档作为输入并为该输入文档指定单个标签。 我已经标记了一个字,直到现在。例如,输入文档可以包含6到10个句子,并且整个文档将被标记为单个类别用于训练。使用SVM为整个文档提供单个标记

回答

1

的基本方法是如下:

  1. 创建培训文件和标签/类的列表。
  2. 标记您的培训文件。
  3. 删除文档中的停用词。
  4. 为您的文档创建TF-IDF值。
  5. 将您的TF-IDF值限制为N个最常见的值。 N = 1000。
  6. 在有限的TF-IDF数据和您的标签上训练SVM。

然后你有一个分类器可以将TF-IDF格式的文档映射到类标签。因此,您可以在将测试文档转换为类似的TF-IDF格式后对其进行分类。

这里是用Python scikit对于作为分类文档的SVM的例子无论是关于狐狸或城市:

from sklearn import svm 
from sklearn.feature_extraction.text import TfidfVectorizer 

# Training examples (already tokenized, 6x fox and 6x city) 
docs_train = [ 
    "The fox jumped over the fence .", 
    "The fox sleeps under the tree .", 
    "A fox walks through the high grass .", 
    "Didn 't see a single fox today .", 
    "I saw a fox yesterday near the lake .", 
    "You might encounter foxes at the lake .", 

    "New York City is full of skyscrapers .", 
    "Los Angeles is a city on the west coast .", 
    "I 've been to Los Angeles before .", 
    "Let 's travel to Mexico City .", 
    "There are no skyscrapers in Washington .", 
    "Washington is a beautiful city ." 
] 

# Test examples (already tokenized, 2x fox and 2x city) 
docs_test = [ 
    "There 's a fox in the garden .", 
    "Did you see the fox next to the tree ?", 
    "What 's the shortest way to Los Alamos ?", 
    "Traffic in New York is a pain" 
] 

# Labels of training examples (6x fox and 6x city) 
y_train = ["fox", "fox", "fox", "fox", "fox", "fox", 
      "city", "city", "city", "city", "city", "city"] 

# Convert training and test examples to TFIDF 
# The vectorizer also removes stopwords and converts the texts to lowercase. 
vectorizer = TfidfVectorizer(max_df=1.0, max_features=10000, 
          min_df=0, stop_words='english') 

vectorizer.fit(docs_train + docs_test) 

X_train = vectorizer.transform(docs_train) 
X_test = vectorizer.transform(docs_test) 

# Train an SVM on TFIDF data of the training documents 
clf = svm.SVC() 
clf.fit(X_train, y_train) 

# Test the SVM on TFIDF data of the test documents 
print clf.predict(X_test) 

输出为预期(2X狐狸和2个城市):

['fox' 'fox' 'city' 'city']