在应用ngram之前理解输入文本的最佳方法

目前我正在阅读excel文件中的文本并将它应用于bigram。 finalList已在下面的示例代码中使用的列表具有输入词列表从输入excel文件中读取。在应用ngram之前理解输入文本的最佳方法

删除从输入禁用词具有下列库的帮助：

from nltk.corpus import stopwords

二元逻辑应用于字的输入文本

bigram=ngrams(finalList ,2)

输入文本的列表：我完成了我的端至端处理。

电流输出：完成结束，结束，结束过程。

所需输出：完成的端到端，端到端的过程。

这意味着像（端到端）这样的一些词组应该被认为是1个词。

来源

2017-10-09 Madhuri

检查你的标记？ – alexis

使用正确的标记器：http://nlp.cogcomp.org/ – Daniel

要解决您的问题，您必须使用正则表达式清除停用词。看到这个例子：

import re 
text = 'I completed my end-to-end process..:?' 
pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words. 
new_text = re.sub(pattern, '', text) 
print(new_text) 
'I completed my end-to-end process' 


# Now you can generate bigrams manually. 
# 1. Tokanize the new text 
tok = new_text.split() 
print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5]) 
['I', 'completed', 'my', 'end-to-end', 'process'] 

# 2. Loop over the list and generate bigrams, store them in a var called bigrams 
bigrams = [] 
for i in range(len(tok) - 1): # -1 to avoid index error 
    bigram = tok[i] + ' ' + tok[i + 1] 
    bigrams.append(bigram) 


# 3. Print your bigrams 
for bi in bigrams: 
    print(bi, end = ', ') 

I completed, completed my, my end-to-end, end-to-end process,

我希望这有助于！

来源

2017-10-12 22:43:10 Mohammed

在应用ngram之前理解输入文本的最佳方法

回答

相关问题