2011-03-30 73 views
73

如何删除停止的话让我有我想从使用使用NLTK或Python

stopwords.words('english') 

去除停止词我挣扎如何使用这个我的代码内只是单纯地拿出一个数据集这些字。我的单词的列表,从这个数据集已经,我用的比较这个列表,而删除停用词挣扎的一部分。 任何帮助表示赞赏。

+4

你从哪里得到的禁用词?这是NLTK吗? – 2014-04-07 22:15:14

+25

@ MattO'Brien'from nltk.corpus import stopwords' for future googlers – danodonovan 2015-05-13 21:11:43

+11

为了使停用词典可用,还需要运行'nltk.download(“stopwords”)''。 – sffc 2015-07-10 17:12:03

回答

14

我想你必须从中要删除禁用词字(WORD_LIST)的列表。你可以做这样的事情:

filtered_word_list = word_list[:] #make a copy of the word_list 
for word in word_list: # iterate over word_list 
    if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword 
+3

这将比Daren Thomas的列表理解慢很多... – drevicko 2016-08-26 10:54:01

147
from nltk.corpus import stopwords 
# ... 
filtered_words = [word for word in word_list if word not in stopwords.words('english')] 
+0

由于这两个答案,他们都工作,虽然它会我似乎有一个缺陷在我的代码阻止正常工作的停止列表。这应该是一个新的问题吗?不确定这里的事情是如何运作的! – Alex 2011-03-30 14:29:58

+29

为了提高性能,请考虑''stops = set(stopwords.words(“english”))'''代替。 – isakkarlsson 2013-09-07 22:04:31

+1

>>> import nltk >>> nltk.download() [Source](http://www.nltk.org/data.html) – 2017-12-14 20:33:51

19

你也可以做一组差异,例如:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english'))) 
+6

注意:这会将句子转换为SET,以删除所有重复的单词因此您将无法对结果使用频率计数 – 2017-02-21 23:59:40

0
import sys 
print ("enter the string from which you want to remove list of stop words") 
userstring = input().split(" ") 
list =["a","an","the","in"] 
another_list = [] 
for x in userstring: 
    if x not in list:   # comparing from the list and removing it 
     another_list.append(x) # it is also possible to use .remove 
for x in another_list: 
    print(x,end=' ') 

    # 2) if you want to use .remove more preferred code 
    import sys 
    print ("enter the string from which you want to remove list of stop words") 
    userstring = input().split(" ") 
    list =["a","an","the","in"] 
    another_list = [] 
    for x in userstring: 
     if x in list:   
      userstring.remove(x) 
    for x in userstring:   
     print(x,end = ' ') 
    #the code will be like this 
0

你可以使用这个功能,你应该注意,您需要降低所有单词

from nltk.corpus import stopwords 

def remove_stopwords(word_list): 
     processed_word_list = [] 
     for word in word_list: 
      word = word.lower() # in case they arenet all lower cased 
      if word not in stopwords.words("english"): 
       processed_word_list.append(word) 
     return processed_word_list 
1

使用filter

from nltk.corpus import stopwords 
# ... 
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list)) 
4

排除所有类型的禁用词包括NLTK停止字,你可以做这样的事情:

from many_stop_words import get_stop_words 
from nltk.corpus import stopwords 

stop_words = list(get_stop_words('en'))   #About 900 stopwords 
nltk_words = list(stopwords.words('english')) #About 150 stopwords 
stop_words.extend(nltk_words) 

output = [w for w in word_list if not w in stop_words]