从一组文档

我有一套3000个文本文档中提取最重要的关键词，我想提取300强的关键词（可以是单个词或多个单词）。从一组文档

我曾尝试下面的方法 -

RAKE：这是一个基于Python的关键词提取库，无疾而终。

Tf-Idf：它给了我每个文档好的关键字，但我们不我能够聚集并找到代表的文件全组关键字。另外，仅仅根据Tf-Idf得分从每个文档中选择前k个单词将无济于事，对吧？

Word2vec：我能够做一些很酷的东西，如发现类似的话，但不知道如何使用它来寻找重要的关键字。

能否请您推荐一些好的方法（或阐述如何提高任何上述3）来解决这个问题呢？谢谢:)

来源

2017-08-24 Vijender

是更好地为您手动选择那些300个字（它不是这么多，是一个时间） - 编写的代码在Python 3

import os 
files = os.listdir() 
topWords = ["word1", "word2.... etc"] 
wordsCount = 0 
for file in files: 
     file_opened = open(file, "r") 
     lines = file_opened.read().split("\n") 
     for word in topWords: 
       if word in lines and wordsCount < 301: 
           print("I found %s" %word) 
           wordsCount += 1 
     #Check Again wordsCount to close first repetitive instruction 
     if wordsCount == 300: 
       break

来源

2017-08-24 12:21:41 durduliu2009

-1

import os 
import operator 
from collections import defaultdict 
files = os.listdir() 
topWords = ["word1", "word2.... etc"] 
wordsCount = 0 
words = defaultdict(lambda: 0) 
for file in files: 
    open_file = open(file, "r") 
    for line in open_file.readlines(): 
     raw_words = line.split() 
     for word in raw_words: 
      words[word] += 1 
sorted_words = sorted(words.items(), key=operator.itemgetter(1))

现在就顶300从排序的话，他们是你想要的话。

来源

2017-08-24 13:13:42

谢谢@Awaish，但我也试过这个。这种方法的结果很差，因为重要的术语只出现一次或两次。如果我尝试根据频率对Tf-idf术语进行排序和选择，会出现许多常见和不相关的术语。 – Vijender

最简单有效的方法申请最重要的词的TF-IDF实现。如果您有停用词，您可以在应用此代码之前过滤停用词。希望这对你有用。

import java.util.List; 

/** 
* Class to calculate TfIdf of term. 
* @author Mubin Shrestha 
*/ 
public class TfIdf { 

    /** 
    * Calculates the tf of term termToCheck 
    * @param totalterms : Array of all the words under processing document 
    * @param termToCheck : term of which tf is to be calculated. 
    * @return tf(term frequency) of term termToCheck 
    */ 
    public double tfCalculator(String[] totalterms, String termToCheck) { 
     double count = 0; //to count the overall occurrence of the term termToCheck 
     for (String s : totalterms) { 
      if (s.equalsIgnoreCase(termToCheck)) { 
       count++; 
      } 
     } 
     return count/totalterms.length; 
    } 

    /** 
    * Calculates idf of term termToCheck 
    * @param allTerms : all the terms of all the documents 
    * @param termToCheck 
    * @return idf(inverse document frequency) score 
    */ 
    public double idfCalculator(List allTerms, String termToCheck) { 
     double count = 0; 
     for (String[] ss : allTerms) { 
      for (String s : ss) { 
       if (s.equalsIgnoreCase(termToCheck)) { 
        count++; 
        break; 
       } 
      } 
     } 
     return 1 + Math.log(allTerms.size()/count); 
    } 
}

来源

2017-08-25 18:00:41 shiv

谢谢@shiv。但是我已经实现了Tf-Idf，并且我使用Lucene来实现（为了更快的处理）。问题是Tf-Idf为每个文档提供“重要条款”，而不是整套文档。 – Vijender

回答

相关问题