有什么方法可以通过scikit-learn来实现skip gram吗？

有没有什么办法可以在scikit学习库上实现skip gram？我已经用n-skim克手动生成了一个列表，并将其作为CountVectorizer()方法的词汇表传递给跳过列表。有什么方法可以通过scikit-learn来实现skip gram吗？

不幸的是，它在预测方面的表现非常差：只有63％的准确性。但是，我使用ngram_range(min,max)从默认代码获得了012-上77-80％的准确性。

scikit有没有更好的方法来实现skip-grams？

这里是我的部分代码：

corpus = GetCorpus() # This one get text from file as a list 

vocabulary = list(GetVocabulary(corpus,k,n)) 
# this one returns a k-skip n-gram 

vec = CountVectorizer(
      tokenizer=lambda x: x.split(), 
      ngram_range=(2,2), 
      stop_words=stopWords, 
      vocabulary=vocabulary)

来源

2016-09-27 Md. Sulayman

向量化与跳跃克文scikit学习只是传递跳过克令牌的词汇CountVectorizer将无法正常工作。您需要修改可以使用自定义分析器完成的令牌处理方式。下面是一个例子矢量器产生1-跳过-2-克，

from toolz import itertoolz, compose 
from toolz.curried import map as cmap, sliding_window, pluck 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 
    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     return lambda doc: self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(doc), 
       stop_words) 

    def _word_skip_grams(self, tokens, stop_words=None): 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens)

例如，在this Wikipedia example，

text = ['the rain in Spain falls mainly on the plain'] 

vect = SkipGramVectorizer() 
vect.fit(text) 
vect.get_feature_names()

这将矢量化将产生下列标记，

['falls on', 'in falls', 'mainly the', 'on plain', 
'rain spain', 'spain mainly', 'the in']

来源

2017-09-01 10:08:13 rth

感谢您的回复，兄弟。我会很快尝试并让你知道它。 –

我想出了我自己的跳跃向量化器的实现。它的灵感来自于this的帖子。为了限制特征空间，我还限制了不跳过句子的边界（使用nltk.sent_tokenize）。这里是我的代码：

import nltk 
from itertools import combinations 
from toolz import compose 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 

    def __init__(self, k=1, **kwds): 
     super(SkipGramVectorizer, self).__init__(**kwds) 
     self.k=k 

    def build_sent_analyzer(self, preprocess, stop_words, tokenize): 
     return lambda sent : self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(sent), 
       stop_words) 

    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     sent_analyze = self.build_sent_analyzer(preprocess, stop_words, tokenize) 

     return lambda doc : self._sent_skip_grams(doc, sent_analyze) 

    def _sent_skip_grams(self, doc, sent_analyze): 
     skip_grams = [] 
     for sent in nltk.sent_tokenize(doc): 
      skip_grams.extend(sent_analyze(sent)) 
     return skip_grams 

    def _word_skip_grams(self, tokens, stop_words=None): 
     """Turn tokens into a sequence of n-grams after stop words filtering""" 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     # handle token n-grams 
     min_n, max_n = self.ngram_range 
     k = self.k 
     if max_n != 1: 
      original_tokens = tokens 
      if min_n == 1: 
       # no need to do any slicing for unigrams 
       # just iterate through the original tokens 
       tokens = list(original_tokens) 
       min_n += 1 
      else: 
       tokens = [] 

      n_original_tokens = len(original_tokens) 

      # bind method outside of loop to reduce overhead 
      tokens_append = tokens.append 
      space_join = " ".join 

      for n in xrange(min_n, 
          min(max_n + 1, n_original_tokens + 1)): 
       for i in xrange(n_original_tokens - n + 1): 
        # k-skip-n-grams 
        head = [original_tokens[i]]      
        for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
         tokens_append(space_join(head + list(skip_tail))) 
     return tokens 

def test(text, ngram_range, k): 
    vectorizer = SkipGramVectorizer(ngram_range=ngram_range, k=k) 
    vectorizer.fit_transform(text) 
    print(vectorizer.get_feature_names()) 

def main(): 
    text = ['Insurgents killed in ongoing fighting.'] 

    # 2-skip-bi-grams 
    test(text, (2,2), 2) 
    # 2-skip-tri-grams 
    test(text, (3,3), 2) 
############################################################################################### 
if __name__ == '__main__': 
    main()

这将产生以下功能名称：

[u'in fighting', u'in ongoing', u'insurgents in', u'insurgents killed', u'insurgents ongoing', u'killed fighting', u'killed in', u'killed ongoing', u'ongoing fighting'] 
[u'in ongoing fighting', u'insurgents in fighting', u'insurgents in ongoing', u'insurgents killed fighting', u'insurgents killed in', u'insurgents killed ongoing', u'insurgents ongoing fighting', u'killed in fighting', u'killed in ongoing', u'killed ongoing fighting']

请注意，我基本上是从VectorizerMixin类拿了_word_ngrams功能，取代了线

tokens_append(space_join(original_tokens[i: i + n]))

与以下内容：

head = [original_tokens[i]]      
for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
    tokens_append(space_join(head + list(skip_tail)))

来源

2017-12-19 20:27:44 Characeae

有什么方法可以通过scikit-learn来实现skip gram吗？

回答

相关问题