2013-02-22 147 views
12

使用gensim我能够从LSA中的一组文档中提取主题,但是如何访问由LDA模型生成的主题?如何从gensim打印LDA主题模型? Python

当打印lda.print_topics(10)的代码提供了以下错误,因为print_topics()回报NoneType

Traceback (most recent call last): 
    File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> 
    for top in lda.print_topics(2): 
TypeError: 'NoneType' object is not iterable 

代码:

from gensim import corpora, models, similarities 
from gensim.models import hdpmodel, ldamodel 
from itertools import izip 

documents = ["Human machine interface for lab abc computer applications", 
       "A survey of user opinion of computer system response time", 
       "The EPS user interface management system", 
       "System and human system engineering testing of EPS", 
       "Relation of user perceived response time to error measurement", 
       "The generation of random binary unordered trees", 
       "The intersection graph of paths in trees", 
       "Graph minors IV Widths of trees and well quasi ordering", 
       "Graph minors A survey"] 

# remove common words and tokenize 
stoplist = set('for a of the and to in'.split()) 
texts = [[word for word in document.lower().split() if word not in stoplist] 
     for document in documents] 

# remove words that appear only once 
all_tokens = sum(texts, []) 
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1) 
texts = [[word for word in text if word not in tokens_once] 
     for text in texts] 

dictionary = corpora.Dictionary(texts) 
corpus = [dictionary.doc2bow(text) for text in texts] 

# I can print out the topics for LSA 
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) 
corpus_lsi = lsi[corpus] 

for l,t in izip(corpus_lsi,corpus): 
    print l,"#",t 
print 
for top in lsi.print_topics(2): 
    print top 

# I can print out the documents and which is the most probable topics for each doc. 
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50) 
corpus_lda = lda[corpus] 

for l,t in izip(corpus_lda,corpus): 
    print l,"#",t 
print 

# But I am unable to print out the topics, how should i do it? 
for top in lda.print_topics(10): 
    print top 
+0

你的代码中缺少一些东西,即corpus_tfidf计算。你能否补充剩余的部分? – mel 2015-02-05 16:38:38

回答

14

一些插科打诨后,好像print_topics(numoftopics)ldamodel有一些错误。所以我的解决方法是使用print_topic(topicid)

>>> print lda.print_topics() 
None 
>>> for i in range(0, lda.num_topics-1): 
>>> print lda.print_topic(i) 
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system 
... 
+4

'print_topics'是前面五个主题的'show_topics'的别名。只要写'lda.show_topics()',不需要'print'。 – mac389 2013-02-24 21:35:28

6

你使用任何记录? print_topics按照docs中的说明打印到日志文件。

正如@ mac389所说,lda.show_topics()是打印到屏幕的方式。

+0

我没有使用任何日志记录,因为我需要立即使用这些主题。你是对的,'lda.show_topics()'或'lda.print_topic(i)'是要走的路。 – alvas 2013-03-06 23:40:11

2

下面是示例代码打印主题:

def ExtractTopics(filename, numTopics=5): 
    # filename is a pickle file where I have lists of lists containing bag of words 
    texts = pickle.load(open(filename, "rb")) 

    # generate dictionary 
    dict = corpora.Dictionary(texts) 

    # remove words with low freq. 3 is an arbitrary number I have picked here 
    low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3] 
    dict.filter_tokens(low_occerance_ids) 
    dict.compactify() 
    corpus = [dict.doc2bow(t) for t in texts] 
    # Generate LDA Model 
    lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics) 
    i = 0 
    # We print the topics 
    for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20): 
     i = i + 1 
     print "TopiC#" + str(i) + ":", 
     for p, id in topic: 
      print dict[int(id)], 

     print "" 
+0

我试图运行你的代码,我将包含BOW的列表的列表传递给文本。我得到以下错误: TypeError:show_topics()得到了一个意想不到的关键字参数'topics' – mribot 2015-02-09 21:31:05

+1

try num_topics。我纠正了上面的代码。 – 2015-02-10 18:22:23

7

我觉得show_topics的语法随着时间的推移发生了变化:

show_topics(num_topics=10, num_words=10, log=False, formatted=True) 

为主题的NUM_TOPICS数,返回NUM_WORDS最显著字(10个字每个主题,默认情况下)。

主题返回为一个列表 - 如果格式化为True,则返回一个字符串列表;如果为False,则返回(概率,字)2元组列表。

如果日志为真,也会将此结果输出到日志。

与LSA不同,LDA中的主题之间没有自然排序。返回的num_topics < =所有主题的self.num_topics子集因此是任意的,并可能在两次LDA培训运行之间发生变化。

3

你可以使用:

for i in lda_model.show_topics(): 
    print i[0], i[1] 
0

近日,碰到类似的问题来到与Python 3和Gensim 2.3.0工作时。 print_topics()show_topics()没有给出任何错误,但也没有打印任何内容。原来show_topics()返回一个列表。所以人们可以简单地做:

topic_list = show_topics() 
print(topic_list)