去除 - VoidCC

我不能想出停止词和string.punctuation为什么这不工作：去除

import nltk 
from nltk.corpus import stopwords 
import string 

with open('moby.txt', 'r') as f: 
    moby_raw = f.read() 
    stop = set(stopwords.words('english')) 
    moby_tokens = nltk.word_tokenize(moby_raw) 
    text_no_stop_words_punct = [t for t in moby_tokens if t not in stop or t not in string.punctuation] 

    print(text_no_stop_words_punct)

查看输出我有这样的：

[...';', 'surging', 'from', 'side', 'to', 'side', ';', 'spasmodically', 'dilating', 'and', 'contracting',...]

似乎标点符号还在那儿。我做错了什么？

来源

2017-08-04 Lime In The Coconut

它必须是and，不or：

if t not in stop and t not in string.punctuation

或者：

if not (t in stop or t in string.punctuation):

或者：

all_stops = stop | set(string.punctuation) 
if t not in all_stops:

后者的解决方案是最快的。

来源

2017-08-04 22:21:23 DyZ

在这行改变尝试改变'或'为'和'这样你的列表将只返回不是停用词，也不是标点符号的单词。

text_no_stop_words = [t for t in moby_tokens if t not in stop or t not in string.punctuation]

来源

2017-08-04 22:21:05 vealkind

关闭。您需要在比较中使用and而不是or。如果结果是一个像“;”的标点符号不在stop那么python不会检查它是否在string.punctuation。

text_no_stop_words_punct = [t for t in moby_tokens if t not in stop and t not in string.punctuation]

来源

2017-08-04 22:24:20

去除

回答

相关问题