2016-10-09 26 views
1

我试图与俄罗斯停止字TfidfVectorizer:ValueError异常:没有内置的禁止入境名单:俄语

Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian') 
Z = Tfidf.fit_transform(X) 

适用TfidfVectorizer,我也得到

ValueError: not a built-in stop list: russian 

当我用英语停止的话这是正确的

Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='english') 
Z = Tfidf.fit_transform(X) 

如何改进? 完整回溯

<ipython-input-118-e787bf15d612> in <module>() 
     1 Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian') 
----> 2 Z = Tfidf.fit_transform(X) 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y) 
    1303    Tf-idf-weighted document-term matrix. 
    1304   """ 
-> 1305   X = super(TfidfVectorizer, self).fit_transform(raw_documents) 
    1306   self._tfidf.fit(X) 
    1307   # X is already a transformed view of raw_documents so 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y) 
    815 
    816   vocabulary, X = self._count_vocab(raw_documents, 
--> 817           self.fixed_vocabulary_) 
    818 
    819   if self.binary: 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab) 
    745    vocabulary.default_factory = vocabulary.__len__ 
    746 
--> 747   analyze = self.build_analyzer() 
    748   j_indices = _make_int_array() 
    749   indptr = _make_int_array() 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self) 
    232 
    233   elif self.analyzer == 'word': 
--> 234    stop_words = self.get_stop_words() 
    235    tokenize = self.build_tokenizer() 
    236 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in get_stop_words(self) 
    215  def get_stop_words(self): 
    216   """Build or fetch the effective stop words list""" 
--> 217   return _check_stop_list(self.stop_words) 
    218 
    219  def build_analyzer(self): 

C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _check_stop_list(stop) 
    88   return ENGLISH_STOP_WORDS 
    89  elif isinstance(stop, six.string_types): 
---> 90   raise ValueError("not a built-in stop list: %s" % stop) 
    91  elif stop is None: 
    92   return None 

ValueError: not a built-in stop list: russian 

回答

2

可能你们看documentation第一,然后再发布?

stop_words : string {‘english’}, list, or None (default)

If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.