2017-10-11 144 views
2

gensim.corpora.Dictionary是否保存了术语频率?gensim.corpora.Dictionary是否有保存频率的频率?

gensim.corpora.Dictionary,它可能得到的话文档频率(即怎么一个特定的词出现在许多文件):

from nltk.corpus import brown 
from gensim.corpora import Dictionary 

documents = brown.sents() 
brown_dict = Dictionary(documents) 

# The 100th word in the dictionary: 'these' 
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents') 

[出]:

The word "these" appears in 1213 documents 

而且有filter_n_most_frequent(remove_n)函数可以删除第n个最常用的标记:

filter_n_most_frequent(remove_n) 过滤掉出现在文档中的'remove_n'最常见的标记。

修剪后,缩小词ID中的空白。

注意:由于间隙缩小,在调用此函数之前和之后,同一个单词可能会有不同的单词ID!

filter_n_most_frequent函数是根据文档频率还是词频删除第n个最频繁的函数?

如果是后者,是否有某种方法可以访问gensim.corpora.Dictionary对象中单词的词频?

回答

2

不,gensim.corpora.Dictionary不保存术语频率。你可以see the source code here。类只存储以下成员变量:

self.token2id = {} # token -> tokenId 
    self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory 
    self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared 

    self.num_docs = 0 # number of documents processed 
    self.num_pos = 0 # total number of corpus positions 
    self.num_nnz = 0 # total number of non-zeroes in the BOW matrix 

这意味着一切在类频率定义为文档频率,从未术语频率,因为后者从不全局存储。这适用于filter_n_most_frequent(remove_n)以及其他所有方法。

0

你能做这样的事吗?

dictionary = corpora.Dictionary(documents) 
corpus = [dictionary.doc2bow(sent) for sent in documents] 
vocab = list(dictionary.values()) #list of terms in the dictionary 
vocab_tf = [dict(i) for i in corpus] 
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies