2016-09-27 60 views
5

有没有什么办法可以在scikit学习库上实现skip gram? 我已经用n-skim克手动生成了一个列表,并将其作为CountVectorizer()方法的词汇表传递给跳过列表。有什么方法可以通过scikit-learn来实现skip gram吗?

不幸的是,它在预测方面的表现非常差:只有63%的准确性。 但是,我使用ngram_range(min,max)从默认代码获得了012-上77-80%的准确性。

scikit有没有更好的方法来实现skip-grams?

这里是我的部分代码:

corpus = GetCorpus() # This one get text from file as a list 

vocabulary = list(GetVocabulary(corpus,k,n)) 
# this one returns a k-skip n-gram 

vec = CountVectorizer(
      tokenizer=lambda x: x.split(), 
      ngram_range=(2,2), 
      stop_words=stopWords, 
      vocabulary=vocabulary) 

回答

6

向量化与跳跃克文scikit学习只是传递跳过克令牌的词汇CountVectorizer将无法​​正常工作。您需要修改可以使用自定义分析器完成的令牌处理方式。下面是一个例子矢量器产生1-跳过-2-克,

from toolz import itertoolz, compose 
from toolz.curried import map as cmap, sliding_window, pluck 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 
    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     return lambda doc: self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(doc), 
       stop_words) 

    def _word_skip_grams(self, tokens, stop_words=None): 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens) 

例如,在this Wikipedia example

text = ['the rain in Spain falls mainly on the plain'] 

vect = SkipGramVectorizer() 
vect.fit(text) 
vect.get_feature_names() 

这将矢量化将产生下列标记,

['falls on', 'in falls', 'mainly the', 'on plain', 
'rain spain', 'spain mainly', 'the in'] 
+0

感谢您的回复,兄弟。我会很快尝试并让你知道它。 –

4

我想出了我自己的跳跃向量化器的实现。它的灵感来自于this的帖子。为了限制特征空间,我还限制了不跳过句子的边界(使用nltk.sent_tokenize)。这里是我的代码:

import nltk 
from itertools import combinations 
from toolz import compose 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 

    def __init__(self, k=1, **kwds): 
     super(SkipGramVectorizer, self).__init__(**kwds) 
     self.k=k 

    def build_sent_analyzer(self, preprocess, stop_words, tokenize): 
     return lambda sent : self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(sent), 
       stop_words) 

    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     sent_analyze = self.build_sent_analyzer(preprocess, stop_words, tokenize) 

     return lambda doc : self._sent_skip_grams(doc, sent_analyze) 

    def _sent_skip_grams(self, doc, sent_analyze): 
     skip_grams = [] 
     for sent in nltk.sent_tokenize(doc): 
      skip_grams.extend(sent_analyze(sent)) 
     return skip_grams 

    def _word_skip_grams(self, tokens, stop_words=None): 
     """Turn tokens into a sequence of n-grams after stop words filtering""" 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     # handle token n-grams 
     min_n, max_n = self.ngram_range 
     k = self.k 
     if max_n != 1: 
      original_tokens = tokens 
      if min_n == 1: 
       # no need to do any slicing for unigrams 
       # just iterate through the original tokens 
       tokens = list(original_tokens) 
       min_n += 1 
      else: 
       tokens = [] 

      n_original_tokens = len(original_tokens) 

      # bind method outside of loop to reduce overhead 
      tokens_append = tokens.append 
      space_join = " ".join 

      for n in xrange(min_n, 
          min(max_n + 1, n_original_tokens + 1)): 
       for i in xrange(n_original_tokens - n + 1): 
        # k-skip-n-grams 
        head = [original_tokens[i]]      
        for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
         tokens_append(space_join(head + list(skip_tail))) 
     return tokens 

def test(text, ngram_range, k): 
    vectorizer = SkipGramVectorizer(ngram_range=ngram_range, k=k) 
    vectorizer.fit_transform(text) 
    print(vectorizer.get_feature_names()) 

def main(): 
    text = ['Insurgents killed in ongoing fighting.'] 

    # 2-skip-bi-grams 
    test(text, (2,2), 2) 
    # 2-skip-tri-grams 
    test(text, (3,3), 2) 
############################################################################################### 
if __name__ == '__main__': 
    main() 

这将产生以下功能名称:

[u'in fighting', u'in ongoing', u'insurgents in', u'insurgents killed', u'insurgents ongoing', u'killed fighting', u'killed in', u'killed ongoing', u'ongoing fighting'] 
[u'in ongoing fighting', u'insurgents in fighting', u'insurgents in ongoing', u'insurgents killed fighting', u'insurgents killed in', u'insurgents killed ongoing', u'insurgents ongoing fighting', u'killed in fighting', u'killed in ongoing', u'killed ongoing fighting'] 

请注意,我基本上是从VectorizerMixin类拿了_word_ngrams功能,取代了线

tokens_append(space_join(original_tokens[i: i + n])) 

与以下内容:

head = [original_tokens[i]]      
for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
    tokens_append(space_join(head + list(skip_tail))) 
相关问题