2017-10-09 61 views
1

目前我正在阅读excel文件中的文本并将它应用于bigram。 finalList已在下面的示例代码中使用的列表具有输入词列表从输入excel文件中读取。在应用ngram之前理解输入文本的最佳方法

删除从输入禁用词具有下列库的帮助:

from nltk.corpus import stopwords 

二元逻辑应用于字的输入文本

bigram=ngrams(finalList ,2) 

输入文本的列表:我完成了我的端至端处理。

电流输出:完成结束,结束,结束过程。

所需输出:完成的端到端,端到端的过程。

这意味着像(端到端)这样的一些词组应该被认为是1个词。

+2

检查你的标记? – alexis

+1

使用正确的标记器:http://nlp.cogcomp.org/ – Daniel

回答

1

要解决您的问题,您必须使用正则表达式清除停用词。看到这个例子:

import re 
text = 'I completed my end-to-end process..:?' 
pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words. 
new_text = re.sub(pattern, '', text) 
print(new_text) 
'I completed my end-to-end process' 


# Now you can generate bigrams manually. 
# 1. Tokanize the new text 
tok = new_text.split() 
print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5]) 
['I', 'completed', 'my', 'end-to-end', 'process'] 

# 2. Loop over the list and generate bigrams, store them in a var called bigrams 
bigrams = [] 
for i in range(len(tok) - 1): # -1 to avoid index error 
    bigram = tok[i] + ' ' + tok[i + 1] 
    bigrams.append(bigram) 


# 3. Print your bigrams 
for bi in bigrams: 
    print(bi, end = ', ') 

I completed, completed my, my end-to-end, end-to-end process, 

我希望这有助于!

相关问题