2017-05-29 104 views
1

我正在使用pySpark ML LDA库来适应sklearn的20个新闻组数据集上的主题模型。我正在对训练语料库进行标准化标记化,停止词移除和tf-idf转换。最后,我可以得到的主题和打印出来的字指数及其权重:使用Spark LDA可视化主题

topics = model.describeTopics() 
topics.show() 
+-----+--------------------+--------------------+ 
|topic|   termIndices|   termWeights| 
+-----+--------------------+--------------------+ 
| 0|[5456, 6894, 7878...|[0.03716766297248...| 
| 1|[5179, 3810, 1545...|[0.12236370744240...| 
| 2|[5653, 4248, 3655...|[1.90742686393836...| 
... 

然而,如何从长期指标与实际单词映射到可视化的主题? 我使用HashingTF应用于字符串的标记化列表来导出术语索引。如何生成用于可视化主题的词典(从索引到单词的映射)?

回答

0

到HashingTF另一种是产生一个词汇CountVectorizer:

count_vec = CountVectorizer(inputCol="tokens_filtered", outputCol="tf_features", vocabSize=num_features, minDF=2.0) 
count_vec_model = count_vec.fit(newsgroups) 
newsgroups = count_vec_model.transform(newsgroups) 
vocab = count_vec_model.vocabulary 

给定一个词汇作为单词的列表,我们可以索引到它的可视化主题:

topics = model.describeTopics() 
topics_rdd = topics.rdd 

topics_words = topics_rdd\ 
     .map(lambda row: row['termIndices'])\ 
     .map(lambda idx_list: [vocab[idx] for idx in idx_list])\ 
     .collect() 

for idx, topic in enumerate(topics_words): 
    print "topic: ", idx 
    print "----------" 
    for word in topic: 
     print word 
    print "----------"