2017-04-23 91 views
0

我正在使用停用词和句子分词器,但是当我打印过滤的句子时,会给出包括停用词的结果。问题在于它不会忽略输出中的停用词。如何删除句子标记器中的停用词?句子分词器中的停用词

userinput1 = input ("Enter file name:") 
    myfile1 = open(userinput1).read() 
    stop_words = set(stopwords.words("english")) 
    word1 = nltk.sent_tokenize(myfile1) 
    filtration_sentence = [] 
    for w in word1: 
     word = sent_tokenize(myfile1) 
     filtered_sentence = [w for w in word if not w in stop_words] 
     print(filtered_sentence) 

    userinput2 = input ("Enter file name:") 
    myfile2 = open(userinput2).read() 
    stop_words = set(stopwords.words("english")) 
    word2 = nltk.sent_tokenize(myfile2) 
    filtration_sentence = [] 
    for w in word2: 
     word = sent_tokenize(myfile2) 
     filtered_sentence = [w for w in word if not w in stop_words] 
     print(filtered_sentence) 

    stemmer = nltk.stem.porter.PorterStemmer() 
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation) 

    def stem_tokens(tokens): 
     return [stemmer.stem(item) for item in tokens] 

    '''remove punctuation, lowercase, stem''' 
    def normalize(text): 
     return stem_tokens(nltk.sent_tokenize(text.lower().translate(remove_punctuation_map))) 
    vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english') 

    def cosine_sim(myfile1, myfile2): 
     tfidf = vectorizer.fit_transform([myfile1, myfile2]) 
     return ((tfidf * tfidf.T).A)[0,1] 
    print(cosine_sim(myfile1,myfile2)) 

回答

0

我觉得你不能直接从句子中删除stopwords。您必须首先将每个单词的句子拆分出来,或使用nltk.word_tokenize来拆分句子。对于每个单词,您检查它是否在停用词列表中。这里有一个例子:

import nltk 
from nltk.corpus import stopwords 
stopwords_en = set(stopwords.words('english')) 

sents = nltk.sent_tokenize("This is an example sentence. We will remove stop words from this") 

sents_rm_stopwords = [] 
for sent in sents: 
    sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w.lower() not in stopwords_en)) 

输出

['example sentence .', 'remove stop words'] 

,您还可以使用string.punctuation删除标点。

import string 
stopwords_punctuation = stopwords_en.union(string.punctuation) # merge set together 
+0

如何使用string.punctuation? @titipata – Muhammad

+0

'import string'和'string.punctuation',然后你可以做'stopwords_en.union(string.punctuation)'。 – titipata

+0

好吧,我正在努力实现这一点。还有一个问题。我上面的代码将给两个文件之间的余弦相似性,但我希望它会显示两个文件之间的相似性句子..我怎么能打印它们?@titipata – Muhammad