2016-03-02 86 views
0

我正在尝试利用NLTK对一批文件执行术语频率(TF)和逆文档频率(IDF)分析(它们恰好是企业新闻来自IBM的发布)。我知道,NLTK是否有TF IDF功能has been disputed on SO beforehand,但我发现断言指示模块文档确实有他们:查找期限频率和反向文档频率利用NLTK(Python 3.5)

http://www.nltk.org/_modules/nltk/text.html

http://www.nltk.org/api/nltk.html#nltk.text.TextCollection

我从来没有见过或用过“self”或init以预先执行代码。这是我迄今为止所拥有的。任何关于如何修改此代码的建议非常感谢。我目前所拥有的东西没有任何回报。我不太了解NLTK文档中“源”,“自我”或“词语”和“文本”的含义。

import nltk.corpus 
from nltk.text import TextCollection 
from nltk.corpus import gutenberg 
gutenberg.fileids() 

ibm1 = gutenberg.words('ibm-github.txt') 
ibm2 = gutenberg.words('ibm-alior.txt') 

mytexts = TextCollection([ibm1, ibm2]) 
term = 'software' 

def __init__(self, source): 
    if hasattr(source, 'words'): 
     source = [source.words(f) for f in source.fileids()] 

    self._texts = source 
    Text.__init__(self, LazyConcatenation(source)) 
    self._idf_cache = {} 

def tf(self, term, mytexts): 
    result = mytexts.count(term)/len(mytexts) 
    print(result) 

回答

1
from nltk.text import TextCollection 
from nltk.book import text1, text2, text3 

mytexts = TextCollection([text1, text2, text3]) 

# Print the IDF of a word 
print(mytexts.idf("Moby")) 

# tf_idf 
print(mytexts.tf_idf("Moby", text1))