gensim.corpora.Dictionary是否有保存频率的频率？

gensim.corpora.Dictionary是否保存了术语频率？gensim.corpora.Dictionary是否有保存频率的频率？

从gensim.corpora.Dictionary，它可能得到的话文档频率（即怎么一个特定的词出现在许多文件）：

from nltk.corpus import brown 
from gensim.corpora import Dictionary 

documents = brown.sents() 
brown_dict = Dictionary(documents) 

# The 100th word in the dictionary: 'these' 
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

[出]：

The word "these" appears in 1213 documents

而且有filter_n_most_frequent(remove_n)函数可以删除第n个最常用的标记：

filter_n_most_frequent(remove_n) 过滤掉出现在文档中的'remove_n'最常见的标记。

修剪后，缩小词ID中的空白。

注意：由于间隙缩小，在调用此函数之前和之后，同一个单词可能会有不同的单词ID！

filter_n_most_frequent函数是根据文档频率还是词频删除第n个最频繁的函数？

如果是后者，是否有某种方法可以访问gensim.corpora.Dictionary对象中单词的词频？

来源

2017-10-11 alvas

不，gensim.corpora.Dictionary不保存术语频率。你可以see the source code here。类只存储以下成员变量：

self.token2id = {} # token -> tokenId 
    self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory 
    self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared 

    self.num_docs = 0 # number of documents processed 
    self.num_pos = 0 # total number of corpus positions 
    self.num_nnz = 0 # total number of non-zeroes in the BOW matrix

这意味着一切在类频率定义为文档频率，从未术语频率，因为后者从不全局存储。这适用于filter_n_most_frequent(remove_n)以及其他所有方法。

来源

2017-10-17 05:51:36 ubadub

你能做这样的事吗？

dictionary = corpora.Dictionary(documents) 
corpus = [dictionary.doc2bow(sent) for sent in documents] 
vocab = list(dictionary.values()) #list of terms in the dictionary 
vocab_tf = [dict(i) for i in corpus] 
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies

来源

2017-12-28 17:01:34

gensim.corpora.Dictionary是否有保存频率的频率？

回答

相关问题